Re: Cloud SQL SQL Server Clone Failure (FCB::Open ...

flapping-duck · 12-20-2023 05:00 PM

When trying to clone an instance of Sql Server which contains 3 databases, an error reporting the.mdf and .ldf files cannot be found is logged for 2/3 databases. The failure to locate these files causes the clone to fail (despite 1/3 of databases looking to successfully be cloned).

Below is an excerpt from the logs. I've searched online for a solution, but they all involved modifying permissions on the folder containing the .mdf and .ldf files (which I can't do).

Hoping to solicit some ideas or (hopefully!) some solutions to this issue. Not having PITR is a fairly significant problem.

EDIT: Looks like this was a red herring and the real issue is how corrupted transaction logs effect PITR. For details: https://www.googlecloudcommunity.com/gc/Databases/Cloud-SQL-SQL-Server-Clone-Failure-FCB-Open-Failed...

{
  "textPayload": "\r2023-12-20 17:48:36.45 spid32s     FCB::Open failed: Could not open file /var/opt/mssql/data/ef79e9d7-75de-42a2-bd91-a2ee9df47e14_{MY_DB}.mdf for file number 1.  OS error: 2(The system cannot find the file specified.).\r",
  "insertId": "s=49467526bd244e908506623f1a0daed8;i=7cd2;b=5d1a0b44f68b4d9987082780f4b8ac48;m=790a2801;t=60cf99f1f8485;x=1cb654f33a0e15fe-0@ah",
  "resource": {
    "type": "cloudsql_database",
    "labels": {
      "database_id": "{MY_PROJECT}:{CLONED_INSTANCE}",
      "region": "northamerica-northeast1",
      "project_id": "{MY_PROJECT}"
    }
  },
  "timestamp": "2023-12-20T23:48:36.469366Z",
  "severity": "INFO",
  "labels": {
    "_COMM": "dockerd",
    "_GID": "0",
    "_SYSTEMD_UNIT": "docker.service",
    "_BOOT_ID": "5d1a0b44f68b4d9987082780f4b8ac48",
    "_HOSTNAME": "a-1001740856657-regional-vm-ff7e4077e83fbb96-625b",
    "CONTAINER_ID_FULL": "cceb31f52c299697d8e926119521bf7cff2051dab13144bb3496f38a69ac2150",
    "INSTANCE_ID": "8067988355191589818",
    "_TRANSPORT": "journal",
    "_PID": "1016",
    "LOG_BUCKET_NUM": "45",
    "_MACHINE_ID": "de6ef28c724263dd3ed9549364ea8828",
    "_SOURCE_REALTIME_TIMESTAMP": "1703116116469352",
    "_SYSTEMD_INVOCATION_ID": "9fb3844a3be84327a93f2f621d1fd5ec",
    "CONTAINER_ID": "cceb31f52c29",
    "_CAP_EFFECTIVE": "1ffffffffff",
    "INSTANCE_UID": "21-e19b6200-ee00-4b4d-a1f1-763f6de04f4d",
    "_SYSTEMD_SLICE": "system.slice",
    "__REALTIME_TIMESTAMP": "1703116116469366",
    "SYSLOG_IDENTIFIER": "cceb31f52c29",
    "IMAGE_NAME": "gcr.io/speckle-umbrella/sqlserver2019_service:speckle_service_agents_20231029.00_RC02",
    "__MONOTONIC_TIMESTAMP": "2030881266",
    "CONTAINER_NAME": "sql_server",
    "_UID": "0",
    "_CMDLINE": "/usr/bin/dockerd --registry-mirror=https://mirror.gcr.io --host=fd:// --containerd=/var/run/containerd/containerd.sock",
    "_SYSTEMD_CGROUP": "/system.slice/docker.service",
    "CONTAINER_TAG": "cceb31f52c29",
    "_EXE": "/usr/bin/dockerd"
  },
  "logName": "projects/{MY_PROJECT}/logs/cloudsql.googleapis.com%2Fsqlserver.err",
  "receiveTimestamp": "2023-12-20T23:48:37.654310637Z"
}

ms4446

The error message indicates that the .mdf and .ldf files for 2 of the databases cannot be found, which is preventing the clone from completing successfully.

Here are some refined potential causes and solutions for this issue:

Incorrect Source Database Path:
- Verify the file paths in the SQL Server configuration on the source instance. Although these paths are typically managed by the system in Google Cloud SQL, it's important to ensure they are correct and point to the existing database files.
Permissions Issue:
- In Google Cloud SQL, direct access to file permissions is not typically available. However, ensure that the Cloud SQL service account is correctly configured with the necessary roles and permissions to access and read the source database files.
Corrupted Database Files:
- Run a DBCC CHECKDB on the source databases to check for corruption. If corruption is detected, consider restoring the databases from a backup. If this is not feasible, contacting Google Cloud support for assistance might be necessary.
Incomplete Replication:
- If using Point-in-Time Recovery (PITR) for the clone operation, ensure that the replication is complete and fully synchronized before initiating the clone. Incomplete replication can result in missing or inaccessible database files.
Google Cloud SQL Limitations and Known Issues:
- Review Google Cloud SQL's documentation for any known limitations or issues related to cloning SQL Server instances. This can provide insights into specific constraints or requirements of the cloning process.
Contact Google Cloud Support:
- If the issue persists after these checks, I recommend contacting Google Cloud Support. They can offer more in-depth troubleshooting, especially for issues related to the internal workings of the Cloud SQL service.
Logs and Monitoring:
- Utilize Google Cloud's logging and monitoring tools to gather detailed information about the clone operation. This can sometimes reveal insights that are not immediately apparent from the error message alone.
Alternative Strategies:
- Consider alternative strategies for replicating the databases, such as manual export/import or using SQL Server's native backup and restore functionality, if cloning continues to fail.

For a more accurate diagnosis, please provide the following information:

The complete error message, including any additional details or context.
The specific steps you're taking to perform the clone operation.
Any relevant configuration details for the source and target Cloud SQL instances.

flapping-duck

Thanks for the response. I'm fairly certain the issue I'm facing is a bug with Cloud SQL and have contacted support, but that's been slow moving. Maybe posting in this channel will expediate that process.

For those who don't want to read the exhaustive explanation a little further down: PITR appears to do a restore from the most recent full backup (prior to chosen point in time) and re-apply transaction logs up until the desired point in time. If a transaction log is corrupted at any point after the full backup and before the chosen point in time, PITR will fail.

Assuming that's true, there's a couple of problems:

The documentation should very clearly state for PITR the transaction logs are restored from the last full backup which equates to the most recent of [on_demand_backup, oldest_automatic_backup].
A corrupted transaction log at any point since the last full backup will cause PITR to fail no matter what time you select. PITR will only work if a healthy full backup exists and the time selected is after that healthy backup. This means if you have a full backup on Monday, corrupted transaction on Tuesday, something terrible happens Wednesday and you need to do a PITR, you can only choose a time prior to the corrupted transaction log on Tuesday. This to me, is a huge issue.
Restoring a backup to an instance and then cloning to a new instance will give you a completely incorrect instance state unless you run an on-demand backup prior to the PITR

Hopefully I get a meaningful response from support, or someone in the forums can point out an issue with my testing and help clarify how Cloud SQL is handling things.

For full details on how I came to this conclusion, read the below.

I have three databases:

DB_A
DB_B
DB_C

When cloning databases individually, DB_A and DB_B succeed (despite the .mdf file not found error) and it's only DB_C that fails.

Reason? The error I originally reported on is (probably) a red herring and can probably be ignored (although it's pretty confusing to see show up in the logs even when the clone is successful).

I dug through some logs and found an error occurred during the restoration of a log in DB_C.

Could not redo log record (496884:85120:3), for transaction ID (0:1129725227), on page (1:9), allocation unit 6488064, database 'DB_C' (database ID 5). Page: LSN = (496884:85120:2), allocation unit = 6488064, type = 13. Log: OpCode = 4, context 20, PrevPageLSN: (496884:85088:6). Restore from a backup of the database, or repair the database.\r"

Given that neither DB_A or DB_B report an error of this nature, I'm tempted to think this error is the culprit of the clone failure.

To try and resolve this issue:

I ran CHECKDB (which returned no issues) and re-attempted the clone.
Clone failed
Tried a clone from a later point in time. This way, a different backup would be used (or so I think)
Clone failed
Create a new instance, restore an automated backup from the original instance, clone the new instance using PITR
Clone succeeded! But there was no data? That's unusual.

To keep some names clear:

og: The orginal sql server instance. Cloning this failed.
created_from_og_backup: Created from a backup of "og"
backup_clone: Cloned from "created_from_og_backup"

This is some bizarre behavior. Maybe it's because I chose a PITR time that occurred after the "created_from_og_backup" was created, but before the backup from "og" was restored? To test this, I took a look at the backup history and operation history of "created_from_og_backup"

backup history

Created | Type | Location | Description

-----------------------------------------|-------------------|--------------------------|-----------------------------------

Dec 21, 2023, 1:32:15 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup

Dec 21, 2023, 1:30:43 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup

Operations history

Creation Time | Completion Time | Type | Status

-----------------------------------------|-----------------------------------------|------------------------------------------------|-------------------------------

Dec 21, 2023, 2:08:32 PM | Dec 21, 2023, 2:08:47 PM| Update | Update finished

Dec 21, 2023, 1:39:47 PM | Dec 21, 2023, 1:48:42 PM| Restore instance from backup| Restore instance from backup finished

Dec 21, 2023, 1:32:15 PM | Dec 21, 2023, 1:33:46 PM| Backup | Backup finished

Dec 21, 2023, 1:30:43 PM | Dec 21, 2023, 1:32:15 PM| Backup | Backup finished

Dec 21, 2023, 1:24:37 PM | Dec 21, 2023, 1:35:37 PM| Create | Create finished

Ok, so if I do a PITR after "Dec 21, 2023, 1:39:47 PM" it should contain all of the databases and all my data, right?

No! None my data was available. Only available databases were master, model, tempdb, and msdb. This is very unexpected behavior and is almost certainly a bug.

Ok, ok. What if I take a new, on-demand backup of "created_from_og_backup", wait 10 minutes for some logs to be created and backed up, then try another PITR?

the backup

2023-12-21 17:04:44.10 Backup Database backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), pages dumped: 16362819, first LSN: 496889:160560:40, last LSN: 496889:160600:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'15e8ea9178786b60'}). This is an informational message only. No user action is required.

the transaction log backup

2023-12-21 17:18:21.06 Backup Log was backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), first LSN: 496889:160624:1, last LSN: 496889:160664:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'/var/opt/mssql/data/pitr/tlog/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C_2023-12-21-23-18-21.006.log'}). This is an informational message only. No user action is required.

So, did cloning "created_from_og_backup" create a clone with the correct data available?

Yup. That worked.

If taking an on-demand backup of "created_from_og_backup" and then cloning resolved all my issues, would the same work if done on "og"?

Yup. No failure due to a corrupted transaction log.

What does this all mean?

What the above investigation implies to me, is that PITR is applying all transaction logs since that last full backup. Assuming that's true, there's a couple of problems:

The documentation should very clearly state for PITR the transaction logs are restored from the last full backup prior to the chosen point in time.
A corrupted transaction log at any point since the last full backup previous to the selected point in time will cause PITR to fail no matter what time you select. PITR will only work if a healthy full backup exists and the time selected is after that healthy backup. This means if you have a full backup on Monday, corrupted transaction on Tuesday, something terrible happens Wednesday and you need to do a PITR, you can only choose a time prior to the corrupted transaction log on Tuesday. This to me, is a huge issue.
Restoring a backup to an instance and then cloning to a new instance will give you a completely incorrect instance state unless you run an on-demand backup prior to the PITR

PITR SQL Server docs: https://cloud.google.com/sql/docs/sqlserver/backup-recovery/pitr

ms4446

Based on what you've observed, it appears you may have uncovered a potential issue with how Cloud SQL manages its Point-in-Time Recovery (PITR), especially when it encounters corrupted transaction logs.

Moving forward, it's important to keep in touch with Google Cloud Support. Share with them the specific details of what you've noticed, the steps you took that can be repeated to see the same issue, and exact excerpts from the logs that highlight the problem. This detailed information is essential for the support team to fully grasp and effectively tackle the potential bug.

flapping-duck

edit: responded to incorrect thread. See message here: https://www.googlecloudcommunity.com/gc/Databases/Cloud-SQL-SQL-Server-Clone-Failure-FCB-Open-Failed...

Cloud SQL SQL Server Clone Failure (FCB::Open Failed: Could not open file)