Point-in-time recovery failing (archive_command)

Mognu · 03-07-2025 04:26 AM

Hello,

I've been trying to debug to get more information about constantly failing archive_command. But I am unable to go further.

The errors and warnings started unexpectedly as, no changes in database were made by us. Latest maintenance was 2 days before the start of the problem.

Errors in logs are (repeatedly, maybe hundreds of times per day)

INFO 2025-03-05T18:15:50.174684Z 2025-03-05 18:15:49.681 UTC [34]: [1-1] db=,user= LOG: archive command failed with exit code 1
(....)
INFO The failed archive command was: 
  /utils/replication_log_processor 
    -disable_log_to_disk 
    -action=archive 
    -file_name=00000001000000E900000045 
    -local_file_path=pg_wal/00000001000000E900000045 
    -async_archive 
    -remote_storage
(....)
WARNING archiving write-ahead log file "00000001000000E900000045" failed too many times, will try again later

Metrics during the error period, are unreliable as well: (PINK marked period)

Queries were slow or failing during that period.

Instance configs:

availableMaintenanceVersions:
- POSTGRES_16_8.R20250112.01_14
backendType: SECOND_GEN
createTime: '2024-08-11T00:20:25.545Z'
databaseInstalledVersion: POSTGRES_16_6
databaseVersion: POSTGRES_16
gceZone: southamerica-east1-b
geminiConfig:
  activeQueryEnabled: false
  entitled: false
  googleVacuumMgmtEnabled: false
  indexAdvisorEnabled: false
  oomSessionCancelEnabled: false
includeReplicasForMajorVersionUpgrade: false
instanceType: CLOUD_SQL_INSTANCE
kind: sql#instance
maintenanceVersion: POSTGRES_16_6.R20250112.01_03
region: southamerica-east1
satisfiesPzi: false
settings:
  activationPolicy: ALWAYS
  availabilityType: ZONAL
  backupConfiguration:
    backupRetentionSettings:
      retainedBackups: 7
      retentionUnit: COUNT
    enabled: true
    kind: sql#backupConfiguration
    location: us
    pointInTimeRecoveryEnabled: true
    replicationLogArchivingEnabled: true
    startTime: 02:00
    transactionLogRetentionDays: 7
    transactionalLogStorageState: CLOUD_STORAGE
  connectorEnforcement: NOT_REQUIRED
  dataDiskSizeGb: '10'
  dataDiskType: PD_SSD
  databaseFlags:
  - name: cloudsql.logical_decoding
    value: on
  deletionProtectionEnabled: true
  edition: ENTERPRISE
  insightsConfig:
    queryInsightsEnabled: true
    queryPlansPerMinute: 5
    queryStringLength: 1024
  kind: sql#settings
  locationPreference:
    kind: sql#locationPreference
    zone: southamerica-east1-b
  maintenanceWindow:
    day: 7
    hour: 3
    kind: sql#maintenanceWindow
    updateTrack: stable
  pricingPlan: PER_USE
  replicationLagMaxSeconds: 31536000
  replicationType: SYNCHRONOUS
  settingsVersion: '61'
  storageAutoResize: true
  storageAutoResizeLimit: '0'
  tier: db-f1-micro
  userLabels:
    env: prod
sqlNetworkArchitecture: NEW_NETWORK_ARCHITECTURE
state: RUNNABLE
upgradableDatabaseVersions:
- displayName: PostgreSQL 17
  majorVersion: POSTGRES_17
  name: POSTGRES_17

And the output of these commands:

SHOW archive_mode;
> archive_mode: always

SHOW archive_command;
> /utils/replication_log_processor -disable_log_to_disk -action=archive -file_name=%f -local_file_path=%p -async_archive -remote_storage

Fixed by disabling Point-in-time recovery and restarting the instance. I will enable PITR today again, but I would like to understand if there is any issue I must account for, before enabling it again.

Mognu

Update:
I got it working without error after decreasing PITR log retention days.

1 - Restarted Instance: Same error, no changes
2 - Disabled PITR: Problem is solved, instance works fine (but not an option because data is constantly updated)
3 - Enabled PITR (7 days retention): Problem is back.
4 - Decreased PITR retention window, to 2 days: No errors.

My instance is far from struggling, usage is within safe boundaries... CPU and Memory lower than 25-20% usage.

Hopefully someone can help.

mcbsalceda

Hi @Mognu,

Welcome to Google Cloud Community!

Your issue is likely related to PITR logs consuming space on the instance’s data disk. If your instance enabled PITR before Google Cloud Storage (GCS) support was introduced, WAL files are stored on the disk instead of GCS, leading to increased storage usage. Since WAL archiving happens automatically when a file reaches 16MB, a longer retention period (like 7 days) can cause significant disk growth.

Since reducing the retention period to 2 days resolved the issue, it’s likely that storage constraints were affecting the archiving process. You can check detailed disk usage on the Viceroy page—if archived WAL logs take up a large portion, that confirms the cause.

To manage disk usage, you could consider:

Switching PITR logs to GCS if your instance supports it, reducing disk usage.
Taking hourly backups instead, which are stored in GCS and cheaper, though you’d lose second-level restore capability.

If the issue persists or you need help migrating PITR logs to GCS, reaching out to Google Cloud Support for further analysis would be a good next step.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

nimh

Hello,

So we have had the same issues for bunch of micro instances (it has not happended for our larger instances), and I have been in talks with Google support to dig deeper to the root cause.

What the conclusion has been so far is that, they had been updating the query insights agents for the Cloud SQL instans in the same time period. They said issue persisted they recommended us to disable query insight or upgrade the instances to have more ressources.

bob5587205

Just hit the same issue a day ago (test environment, minimal instance, very light load): slow responses (feels laggy in the SQL studio, and the actual service lags), the metrics (CPU utililization, and others) started showing irregular gaps (missing data). Then, traced it to logs, showing that `replication_log_processor` is failing.