Cloud sql proxy occasionally fails with connection...

domeniconappo · 10-11-2022 04:05 AM

Hi there!

We have been investigating a problem for days now and haven't found the cause or a workaround.

On our dev and prod GKE clusters there is connection timeout in cloudsql-proxy sidecar containers every 1/2 hours, triggered by a health db check from our Django 3.2 application, running with uvicorn workers (but we are afraid it can be triggered during normal operations as well).

Very similar issues are on the web (SO, google groups, this forum) but they suggest some solutions related to their particular environment that didn't work in our case and it seems pretty hard to debug.

couldn't connect to "db-development:europe-west4:app-db": dial tcp xxx.xxx.xxx.xxx:3307: connect: connection timed out```

The timestamp of this error matches the one on the app container:

django.db.utils.OperationalError: connection to server at "cloudsql-proxy" (10.97.23.253), port 5432 failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

On dev cluster, just to investigate this issue, we updated cloudsql-proxy to the latest version (1.32.0) and PostgreSQL (from 13 to 14). Errors are still popping up at the same frequency.

1) The errors are not related to misconfiguration (the app works just fine 99.99% of the time) nor autoscaling (timestamps of cluster autoscaling logs don't match the errors timestamps).

2) The PostgreSQL logs show that connections are opened and closed correctly.

3) Connections never reach 1% of the max_connections flag.

4) We use a standard Django app configuration for DB (i.e. connection pooling).
CONN_MAX_AGE is 60 (a connection in the pool can be reused for up to 60 seconds).

Does anyone have any hint to help me to debug/fix this annoying issue?

enocom

Connection timeouts are tricky to diagnose because they're often related to the network topology. Nonetheless, this is unusual behavior.

I see that you're running the Cloud SQL Proxy as a sidecar, but then in the Django logs I see this: "connection to server at cloudsql-proxy" (10.97.23.253)", which makes me think you're running the proxy as a service. Would you mind elaborating how your app connects to the proxy?

Also, have you tried configuring any health checks? A readiness probe might help this situation: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/tree/v1/examples/k8s-health-check

Finally, I suspect this isn't an issue with the proxy, but you might try our new v2 to see if that alleviates the issue.

domeniconappo

Thank you @enocom for your answer!

Sorry about the confusion: the cloudsql proxy runs in its own pod. The app connects to it by using the container name.
I would like to try the v2. I can't find the right docker image tag for it (using the tag 2.0.0-preview.1 here doesn't help: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/releases).

BTW, from yesterday, there is a hint that the issue happens when there is no traffic. So it might be a timeout configuration in PG that conflicts with the CONN_MAX_AGE param in Django. But I wasn't able to find it in DB settings and on client side we don't set any timeout.
Thanks for the tips nonetheless. I will also try to add some probes and see what happens.

enocom

Here's the correct command:

docker pull gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.0.0-preview.1

I would also suggest running the proxy as a sidecar. That might help the connection time outs, but it's more secure as pods aren't necessarily scheduled onto the same VM.

Cloud sql proxy occasionally fails with connections timedout