Re: Why do I have a 1.4Tib of archived WAL logs in...

Rajatj21 · 01-03-2025 02:59 AM

My current CloudSQL Postgres instance has a single read replica and even the table data is aound few Gibs the WAL is in Tibs as we can see in the image.

I want to know why do my replica instance has archived WAL stuck on the disk & if its is safe to remove it and how?

ms4446

In Cloud SQL PostgreSQL, excessive archived WAL (Write-Ahead Log) file accumulation on read replicas typically points to replication lag, though point-in-time recovery (PITR) retention can also contribute. Under normal conditions, CSQL automatically removes WAL files once they are no longer required for replication or recovery. However, significant lag can prevent this cleanup. Common causes of replication lag include network latency, insufficient replica resources, long-running queries on the replica that block the apply process, or stalled replay processes. A long PITR retention window can also increase the amount of WAL retained, as Cloud SQL must preserve logs for recovery within the retention period.

To address this issue, start by verifying replication health using the Cloud Console or gcloud CLI to identify lag or errors. Resolve replication issues by increasing replica resources, optimizing replica-side queries that might block the WAL apply process, and ensuring stable network connectivity. While improving primary instance performance—particularly addressing long-running or "idle in transaction" queries using tools like pg_stat_activity—is generally helpful, the direct impact on WAL accumulation primarily stems from replica-side optimizations.

If PITR is enabled, consider reducing the retention window, balancing storage reduction with recovery needs. Avoid removing the replica unless it is no longer required, and only recreate it if replication is irreparably broken, as indicated in error logs. Investigate long-running or idle transactions on the primary database, as they can generate substantial WAL, overwhelming the replica and delaying replay.

Importantly, never manually delete WAL files. Doing so can severely disrupt replication and recovery processes. CSQL automatically manages WAL cleanup once the files are no longer needed. By addressing these underlying issues, you can ensure efficient WAL management, avoid storage bloat, and maintain the health of your replicas.

Rajatj21

Hi @ms4446

I did check the replication lag, it looks okay.

We used to have WAL log accumulation on the master DB as well but we switched the PITR storage location from disk to cloud storage, so the issue on main DB was gone, but the replica is still facing this issue.

The pg_stat_actiivty command also didn't return any stuck transaction,

Thank you

ms4446

In Cloud SQL excessive archived WAL file accumulation on a replica typically points to replication lag, even after resolving similar issues on the primary instance by moving PITR storage to Cloud Storage. This suggests that the replica is unable to process WAL changes fast enough, leading to a backlog. While pg_stat_activity indicates no current long-running or idle transactions, past intermittent issues may still have contributed. The replication lag spikes seen in the provided chart suggest periodic delays caused by resource limitations, network latency, or large transactions.

To address this, first investigate potential resource constraints on the replica. Monitor CPU and memory usage to ensure they are not nearing capacity. High disk queue depth, read/write latency, or an underperforming disk type (especially if it is lower-performing than the primary) can also slow down WAL application. If resource constraints are evident, scaling the replica to a larger instance tier with better CPU, memory, and disk performance can often resolve the issue.

Network latency is another critical factor. Use Cloud Monitoring to assess latency between the primary and replica, and consider moving them to the same region or zone if they are geographically distant. Additionally, review replication metrics such as replay_lag, write_lag, and flush_lag in pg_stat_replication to pinpoint bottlenecks in WAL processing. Analyzing historical logs for errors related to replication, resource exhaustion, or slow queries can also help identify patterns and root causes.

If replication lag persists despite resource scaling and network optimization, review replica configuration settings like max_standby_streaming_delay and max_standby_archive_delay. While these settings typically work well at their defaults, unnecessarily high values may prolong delays. As a last resort, if replication becomes irreparably stuck, recreating the replica from a recent backup may provide a clean slate and eliminate accumulated WAL.

Rajatj21

Thank @ms4446 for your response.

I did increased the cpu and memory on the replica, but the accumulated WAL log seems to persist there still.

I'm going to monitor the configs and db transactions behaviour for a bit to analyse anything

Why do I have a 1.4Tib of archived WAL logs in the replica instance & how can we clean it up