Hello,
I've been trying to debug to get more information about constantly failing archive_command. But I am unable to go further.
The errors and warnings started unexpectedly as, no changes in database were made by us. Latest maintenance was 2 days before the start of the problem.
Errors in logs are (repeatedly, maybe hundreds of times per day)
INFO 2025-03-05T18:15:50.174684Z 2025-03-05 18:15:49.681 UTC [34]: [1-1] db=,user= LOG: archive command failed with exit code 1
(....)
INFO The failed archive command was:
/utils/replication_log_processor
-disable_log_to_disk
-action=archive
-file_name=00000001000000E900000045
-local_file_path=pg_wal/00000001000000E900000045
-async_archive
-remote_storage
(....)
WARNING archiving write-ahead log file "00000001000000E900000045" failed too many times, will try again later
Metrics during the error period, are unreliable as well: (PINK marked period)
Queries were slow or failing during that period.
Instance configs:
availableMaintenanceVersions:
- POSTGRES_16_8.R20250112.01_14
backendType: SECOND_GEN
createTime: '2024-08-11T00:20:25.545Z'
databaseInstalledVersion: POSTGRES_16_6
databaseVersion: POSTGRES_16
gceZone: southamerica-east1-b
geminiConfig:
activeQueryEnabled: false
entitled: false
googleVacuumMgmtEnabled: false
indexAdvisorEnabled: false
oomSessionCancelEnabled: false
includeReplicasForMajorVersionUpgrade: false
instanceType: CLOUD_SQL_INSTANCE
kind: sql#instance
maintenanceVersion: POSTGRES_16_6.R20250112.01_03
region: southamerica-east1
satisfiesPzi: false
settings:
activationPolicy: ALWAYS
availabilityType: ZONAL
backupConfiguration:
backupRetentionSettings:
retainedBackups: 7
retentionUnit: COUNT
enabled: true
kind: sql#backupConfiguration
location: us
pointInTimeRecoveryEnabled: true
replicationLogArchivingEnabled: true
startTime: 02:00
transactionLogRetentionDays: 7
transactionalLogStorageState: CLOUD_STORAGE
connectorEnforcement: NOT_REQUIRED
dataDiskSizeGb: '10'
dataDiskType: PD_SSD
databaseFlags:
- name: cloudsql.logical_decoding
value: on
deletionProtectionEnabled: true
edition: ENTERPRISE
insightsConfig:
queryInsightsEnabled: true
queryPlansPerMinute: 5
queryStringLength: 1024
kind: sql#settings
locationPreference:
kind: sql#locationPreference
zone: southamerica-east1-b
maintenanceWindow:
day: 7
hour: 3
kind: sql#maintenanceWindow
updateTrack: stable
pricingPlan: PER_USE
replicationLagMaxSeconds: 31536000
replicationType: SYNCHRONOUS
settingsVersion: '61'
storageAutoResize: true
storageAutoResizeLimit: '0'
tier: db-f1-micro
userLabels:
env: prod
sqlNetworkArchitecture: NEW_NETWORK_ARCHITECTURE
state: RUNNABLE
upgradableDatabaseVersions:
- displayName: PostgreSQL 17
majorVersion: POSTGRES_17
name: POSTGRES_17
And the output of these commands:
SHOW archive_mode;
> archive_mode: always
SHOW archive_command;
> /utils/replication_log_processor -disable_log_to_disk -action=archive -file_name=%f -local_file_path=%p -async_archive -remote_storage
Fixed by disabling Point-in-time recovery and restarting the instance. I will enable PITR today again, but I would like to understand if there is any issue I must account for, before enabling it again.
Update:
I got it working without error after decreasing PITR log retention days.
1 - Restarted Instance: Same error, no changes
2 - Disabled PITR: Problem is solved, instance works fine (but not an option because data is constantly updated)
3 - Enabled PITR (7 days retention): Problem is back.
4 - Decreased PITR retention window, to 2 days: No errors.
My instance is far from struggling, usage is within safe boundaries... CPU and Memory lower than 25-20% usage.
Hopefully someone can help.
Hi @Mognu,
Welcome to Google Cloud Community!
Your issue is likely related to PITR logs consuming space on the instance’s data disk. If your instance enabled PITR before Google Cloud Storage (GCS) support was introduced, WAL files are stored on the disk instead of GCS, leading to increased storage usage. Since WAL archiving happens automatically when a file reaches 16MB, a longer retention period (like 7 days) can cause significant disk growth.
Since reducing the retention period to 2 days resolved the issue, it’s likely that storage constraints were affecting the archiving process. You can check detailed disk usage on the Viceroy page—if archived WAL logs take up a large portion, that confirms the cause.
To manage disk usage, you could consider:
If the issue persists or you need help migrating PITR logs to GCS, reaching out to Google Cloud Support for further analysis would be a good next step.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hello,
So we have had the same issues for bunch of micro instances (it has not happended for our larger instances), and I have been in talks with Google support to dig deeper to the root cause.
What the conclusion has been so far is that, they had been updating the query insights agents for the Cloud SQL instans in the same time period. They said issue persisted they recommended us to disable query insight or upgrade the instances to have more ressources.