Re: Intermittend connection errors from Cloud Run ...

KoenDeWit · 01-25-2024 04:22 AM

Our webserver (Django-app on Gunicorn running on Google Cloud Run) connects to a Postgres 15 database (on Google Cloud SQL) through Psycopg. Most queries are successful, but recently ~1% of the queries fail on random moments, with random error messages like:

connection failed: region:db_name/.s.PGSQL.5432" failed: server closed the connection unexpectedly - This probably means the server terminated abnormally before or while processing the request.
got message type "P", length 1380524074
connection failed: region:db_name/.s.PGSQL.5432" failed: FATAL: password authentication failed for user "postgres"
consuming input failed: server closed the connection unexpectedly - This probably means the server terminated abnormally before or while processing the request.
invalid socket

Sometimes we see an error at the server side at the same moment, for example:

FATAL: canceling authentication due to timeout
FATAL: connection to client lost
FATAL: password authentication failed for user

The password authentication failed error puzzles me: we're always connecting with the same password.

The got message type "P" looks cryptic to me, and the length mentioned (over 1G!) is abnormal, I don't see why such a long message is being sent.

In the Django settings file, I tried different settings:

CONN_HEALTH_CHECKS = True or False
CONN_MAX_AGE = 0 (new connection for every request) or None (unlimited persistent connections)

Resources (CPU, memory, disk space, ...) are well below the limits.

PostgreSQL is at version 15, Django at version 4.2.9 and psycopg at version 3.1.17.

We tried reverting to psycopg 3.1.14 and Django to version 4.2.8 since we've been running several weeks without problems before, but the connection issues are still present.

Does anyone have any ideas on how I can investigate this problem?

Toon

I have exactly the same issue as you. It also started recently (~2 weeks ago). We did not have any problems before. (django is 4.2.8, postgres is at 14, psycopg 3.1.16)

Toon

I downgraded the psycodb version back from 3.1.12 to 3.1.8. This seems to solve it for us. It needs a bit more run time before I can consider it solved.

KoenDeWit

Hi Toon,

Thanks for letting us know ! We'll try 3.1.8 if that version solves the intermittent connection errors for you.

Just curious: do you see the same 5 error messages at the Django side and the same 3 error messsages at the PostgreSQL side ?

Toon

I don't see any errors at the Postgresql side. Perhaps that's due to a configuration issue.

At the django side I see the same error messages as you, 95% is "server closed the connection unexpectedly"

I also have this one:

connection failed: <instance>/.s.PGSQL.5432" failed: expected authentication request from server, but received P

KoenDeWit

Thanks Toon! We have an ongoing Google Cloud Support Case, I will let you know if we find a solution to the intermittent connection errors.

mike-cloverleaf

Just to chime in here. My company has also been experiencing this more frequently. We use Cloud Run instances in the us-central1 region, and we use the Cloud SQL Connector in cloud run to connect to Cloud SQL PostgreSQL 13/14 instances also located in us-central1 locations.

We have a bash script which takes data files from cloud storage and then copies it to postgres using psql commands (we do this for refreshing data fast on non-prod cloud sql instances), and connecting through the available unix socket on the cloud run. Yesterday, we constantly were experiencing invalid socket errors and quite frequently last month, as in months prior we would only see it occasionally. The script works just fine as we reference the unix socket correctly. Also, this process yesterday was failing so much on cloud run, I ended up running it locally using my local cloud sql proxy, and it ran without issue the first time. It seems the unix socket is randomly dropping connection on cloud run instances for an unknown reason.

KoenDeWit

Hi Mike,

Thanks for letting us know ! A few questions, to gain a good insight into the problem:

Do you use Psycopg, or another driver for PostgreSQL ? Do you use Django, or SQLAlchemy, or any other ORM or framework ? Do you use connection pooling ?

Do you only see "invalid socket" errors, or also other errors on the Cloud Run side? And do you see any errors on the database side?

mike-cloverleaf

Hi Koen,

The process that refreshes our non-prod cloud sql instances actually runs under Rails which uses ActiveRecord as the ORM. We provide config in Rails to connect to cloud sql database host (unix socket), using the postgresql adapter and specifying pool count as 5. For this we mainly get invalid sockets with the occasional: FATAL: connection to client lost. I do have a cloud sql error every time we get invalid sockets:

FATAL:  terminating connection due to administrator command

Another one of our cloud runs we use SQLAlchemy with FastApi. We are using the connection pool to connect to cloud sql, using Connector from google.cloud.sql.connector. For this api we were seeing enough connection errors that I actually am now catching all SQLAlchemyError's. All errors here were being outputted from cloud run. We were experiencing following error back in October:

 sqlalchemy.exc.DBAPIError: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.ConnectionDoesNotExistError'>: connection was closed in the middle of operation. 

The above exception was the direct cause of the following exception: asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation.

Hope this helps. Thank you!

KoenDeWit

Thanks Mike for this additional information, I have added it to our Google Cloud Support case. It's interesting that you see the same errors with Python&SQLAlchemy and Rails&ActiveRecord, apparently the intermittent connection issues do not only happen with Python&Django.

raulgonzalez

Hi everyone,

I think we are experiencing a similar issue with Cloud Run and Cloud SQL

Are you connecting through a VPC? If so, are you using the Serverless VPC Access Connector or Direct VPC?

greysteil

I think we may be experiencing something similar. Connecting to Cloud SQL (Postgres 15) from Cloud Run using Direct VPC. (We're using Go and Gorm)

KoenDeWit

@raulgonzalezWe don't connect through a VPC, we use Unix sockets.

@raulgonzalezand @greysteil : do you see the same error messages as we see? (error messages are listed above) When did the intermittent connection errors start? Can you give a rough estimate of the percentage of SQL queries or connection attempts that are failing?

greysteil

Started a couple of weeks ago - not exactly sure when. We see

FATAL: pg_hba.conf rejects connection for host ... (SQLSTATE 28000)

which looks pretty similar to your password authentication issue (SQL state 28000 is "invalid authorization specification").

We're also seeing very slow requests for trivial database operations at that time, making me think our problem might be waiting for database connections.

KoenDeWit

The Google support team found that this is an issue at their side, they will fix the problem shortly. If you have this issue and can't wait for the fix by Google, you can switch to the Cloud Run second generation execution environment as a workaround.

mike-cloverleaf

Thanks Koen for dealing with google for almost 2 months! Good to hear they are going to fix their product.

ricardo_nacif

Hey guys, any updates here? We are using the Cloud Run second generation execution environment and are still facing a lot of disconnections between our cloud run service and our SQL db.

KoenDeWit

For us, the problem was solved with Cloud Run second generation execution environment.

I received a notification from Google about a new network architecture for Cloud SQL, but I don't know if this novelty also addresses the problems with Cloud Run first generation execution environment.

Stef_R

Hello there!
Thanks a looot for raising this topic. Are there any recent updates?
I am facing the same issue 4 months after (Django on Cloud Run + Postgres 14 on Cloud SQL). The workaround with using Cloud Run second generation introduces even more connexion losses on our side.

KoenDeWit

I don't know about the first generation, but we didn't see any connection error anymore since switching to Cloud Run second generation.

mike-cloverleaf

We have also not seen any connection errors since moving our affected resources to Cloud Run second generation.

Stef_R

Got it, thank you @KoenDeWit @mike-cloverleaf

msoliz

Hello,

Facing the same issue on Cloud Run second generation and PostgreSQL 16. Some update about this case?

Stef_R

Hi there! On our side it seemed to be more of a consequence of unexpected timeouts than a problem by itself. We were using Gemini via Vertex AI, with multi-threading (ThreadPoolExecutor), then sending Gemini's answer to Postgres. The thread was stuck, never really producing a timeout nor closing properly, leading to a timeout of cloud Run, in turn leading to a timeout of the cloud SQL connexion.

We replace threads with ProcessPoolExecutor, added a timeout and moved all the requests to PostgresSQL out of the pools. Now it seems to work just fine, didn't have the issue in a while.

Intermittend connection errors from Cloud Run to Cloud SQL (PostgreSQL)