When trying to clone an instance of Sql Server which contains 3 databases, an error reporting the.mdf and .ldf files cannot be found is logged for 2/3 databases. The failure to locate these files causes the clone to fail (despite 1/3 of databases looking to successfully be cloned).
Below is an excerpt from the logs. I've searched online for a solution, but they all involved modifying permissions on the folder containing the .mdf and .ldf files (which I can't do).
Hoping to solicit some ideas or (hopefully!) some solutions to this issue. Not having PITR is a fairly significant problem.
EDIT: Looks like this was a red herring and the real issue is how corrupted transaction logs effect PITR. For details: https://www.googlecloudcommunity.com/gc/Databases/Cloud-SQL-SQL-Server-Clone-Failure-FCB-Open-Failed...
{
"textPayload": "\r2023-12-20 17:48:36.45 spid32s FCB::Open failed: Could not open file /var/opt/mssql/data/ef79e9d7-75de-42a2-bd91-a2ee9df47e14_{MY_DB}.mdf for file number 1. OS error: 2(The system cannot find the file specified.).\r",
"insertId": "s=49467526bd244e908506623f1a0daed8;i=7cd2;b=5d1a0b44f68b4d9987082780f4b8ac48;m=790a2801;t=60cf99f1f8485;x=1cb654f33a0e15fe-0@ah",
"resource": {
"type": "cloudsql_database",
"labels": {
"database_id": "{MY_PROJECT}:{CLONED_INSTANCE}",
"region": "northamerica-northeast1",
"project_id": "{MY_PROJECT}"
}
},
"timestamp": "2023-12-20T23:48:36.469366Z",
"severity": "INFO",
"labels": {
"_COMM": "dockerd",
"_GID": "0",
"_SYSTEMD_UNIT": "docker.service",
"_BOOT_ID": "5d1a0b44f68b4d9987082780f4b8ac48",
"_HOSTNAME": "a-1001740856657-regional-vm-ff7e4077e83fbb96-625b",
"CONTAINER_ID_FULL": "cceb31f52c299697d8e926119521bf7cff2051dab13144bb3496f38a69ac2150",
"INSTANCE_ID": "8067988355191589818",
"_TRANSPORT": "journal",
"_PID": "1016",
"LOG_BUCKET_NUM": "45",
"_MACHINE_ID": "de6ef28c724263dd3ed9549364ea8828",
"_SOURCE_REALTIME_TIMESTAMP": "1703116116469352",
"_SYSTEMD_INVOCATION_ID": "9fb3844a3be84327a93f2f621d1fd5ec",
"CONTAINER_ID": "cceb31f52c29",
"_CAP_EFFECTIVE": "1ffffffffff",
"INSTANCE_UID": "21-e19b6200-ee00-4b4d-a1f1-763f6de04f4d",
"_SYSTEMD_SLICE": "system.slice",
"__REALTIME_TIMESTAMP": "1703116116469366",
"SYSLOG_IDENTIFIER": "cceb31f52c29",
"IMAGE_NAME": "gcr.io/speckle-umbrella/sqlserver2019_service:speckle_service_agents_20231029.00_RC02",
"__MONOTONIC_TIMESTAMP": "2030881266",
"CONTAINER_NAME": "sql_server",
"_UID": "0",
"_CMDLINE": "/usr/bin/dockerd --registry-mirror=https://mirror.gcr.io --host=fd:// --containerd=/var/run/containerd/containerd.sock",
"_SYSTEMD_CGROUP": "/system.slice/docker.service",
"CONTAINER_TAG": "cceb31f52c29",
"_EXE": "/usr/bin/dockerd"
},
"logName": "projects/{MY_PROJECT}/logs/cloudsql.googleapis.com%2Fsqlserver.err",
"receiveTimestamp": "2023-12-20T23:48:37.654310637Z"
}
The error message indicates that the .mdf and .ldf files for 2 of the databases cannot be found, which is preventing the clone from completing successfully.
Here are some refined potential causes and solutions for this issue:
Incorrect Source Database Path:
Permissions Issue:
Corrupted Database Files:
Incomplete Replication:
Google Cloud SQL Limitations and Known Issues:
Contact Google Cloud Support:
Logs and Monitoring:
Alternative Strategies:
For a more accurate diagnosis, please provide the following information:
Thanks for the response. I'm fairly certain the issue I'm facing is a bug with Cloud SQL and have contacted support, but that's been slow moving. Maybe posting in this channel will expediate that process.
For those who don't want to read the exhaustive explanation a little further down: PITR appears to do a restore from the most recent full backup (prior to chosen point in time) and re-apply transaction logs up until the desired point in time. If a transaction log is corrupted at any point after the full backup and before the chosen point in time, PITR will fail.
Assuming that's true, there's a couple of problems:
Hopefully I get a meaningful response from support, or someone in the forums can point out an issue with my testing and help clarify how Cloud SQL is handling things.
For full details on how I came to this conclusion, read the below.
I have three databases:
When cloning databases individually, DB_A and DB_B succeed (despite the .mdf file not found error) and it's only DB_C that fails.
Reason? The error I originally reported on is (probably) a red herring and can probably be ignored (although it's pretty confusing to see show up in the logs even when the clone is successful).
I dug through some logs and found an error occurred during the restoration of a log in DB_C.
Could not redo log record (496884:85120:3), for transaction ID (0:1129725227), on page (1:9), allocation unit 6488064, database 'DB_C' (database ID 5). Page: LSN = (496884:85120:2), allocation unit = 6488064, type = 13. Log: OpCode = 4, context 20, PrevPageLSN: (496884:85088:6). Restore from a backup of the database, or repair the database.\r"
Given that neither DB_A or DB_B report an error of this nature, I'm tempted to think this error is the culprit of the clone failure.
To try and resolve this issue:
To keep some names clear:
This is some bizarre behavior. Maybe it's because I chose a PITR time that occurred after the "created_from_og_backup" was created, but before the backup from "og" was restored? To test this, I took a look at the backup history and operation history of "created_from_og_backup"
backup history
Created | Type | Location | Description
-----------------------------------------|-------------------|--------------------------|-----------------------------------
Dec 21, 2023, 1:32:15 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup
Dec 21, 2023, 1:30:43 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup
Operations history
Creation Time | Completion Time | Type | Status
-----------------------------------------|-----------------------------------------|------------------------------------------------|-------------------------------
Dec 21, 2023, 2:08:32 PM | Dec 21, 2023, 2:08:47 PM| Update | Update finished
Dec 21, 2023, 1:39:47 PM | Dec 21, 2023, 1:48:42 PM| Restore instance from backup| Restore instance from backup finished
Dec 21, 2023, 1:32:15 PM | Dec 21, 2023, 1:33:46 PM| Backup | Backup finished
Dec 21, 2023, 1:30:43 PM | Dec 21, 2023, 1:32:15 PM| Backup | Backup finished
Dec 21, 2023, 1:24:37 PM | Dec 21, 2023, 1:35:37 PM| Create | Create finished
Ok, so if I do a PITR after "Dec 21, 2023, 1:39:47 PM" it should contain all of the databases and all my data, right?
No! None my data was available. Only available databases were master, model, tempdb, and msdb. This is very unexpected behavior and is almost certainly a bug.
Ok, ok. What if I take a new, on-demand backup of "created_from_og_backup", wait 10 minutes for some logs to be created and backed up, then try another PITR?
the backup
2023-12-21 17:04:44.10 Backup Database backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), pages dumped: 16362819, first LSN: 496889:160560:40, last LSN: 496889:160600:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'15e8ea9178786b60'}). This is an informational message only. No user action is required.
the transaction log backup
2023-12-21 17:18:21.06 Backup Log was backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), first LSN: 496889:160624:1, last LSN: 496889:160664:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'/var/opt/mssql/data/pitr/tlog/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C_2023-12-21-23-18-21.006.log'}). This is an informational message only. No user action is required.
So, did cloning "created_from_og_backup" create a clone with the correct data available?
Yup. That worked.
If taking an on-demand backup of "created_from_og_backup" and then cloning resolved all my issues, would the same work if done on "og"?
Yup. No failure due to a corrupted transaction log.
What does this all mean?
What the above investigation implies to me, is that PITR is applying all transaction logs since that last full backup. Assuming that's true, there's a couple of problems:
PITR SQL Server docs: https://cloud.google.com/sql/docs/sqlserver/backup-recovery/pitr
Based on what you've observed, it appears you may have uncovered a potential issue with how Cloud SQL manages its Point-in-Time Recovery (PITR), especially when it encounters corrupted transaction logs.
Moving forward, it's important to keep in touch with Google Cloud Support. Share with them the specific details of what you've noticed, the steps you took that can be repeated to see the same issue, and exact excerpts from the logs that highlight the problem. This detailed information is essential for the support team to fully grasp and effectively tackle the potential bug.
edit: responded to incorrect thread. See message here: https://www.googlecloudcommunity.com/gc/Databases/Cloud-SQL-SQL-Server-Clone-Failure-FCB-Open-Failed...