Solved: Re: Cloud Run VPC connection to VM timeouts

iandlab · 01-16-2025 08:06 AM

Hi everyone,

I have a Python App (Superset) running on Cloud Run, using a Postgres DB running on a GCE VM.

The connection works fine on a small scale, but as soon as more users start using the application, we start experiencing connection timeouts between the application (Cloud Run) and the database (VM).

Here's a detailed description of my setup and the problem:

Setup:

Database: PostgreSQL running on a Google Cloud e2-medium VM, using a Container Optimized OS.
Application: Deployed on Google Cloud Run. The backend is Python using SQLAlchemy for database connections.
Connection: Cloud Run connects to the PostgreSQL VM via a VPC, using an internal IP and a network tag to allow traffic.
Application Server: Superset is served by Gunicorn with 4 workers and 8 threads each.

Problem:

Database connections work normally with a low number of Cloud Run containers. However, as I scale up the number of Cloud Run instances (increasing the number of simultaneous connection attempts), I start experiencing connection timeouts. This occurs even though:

PostgreSQL's max_connections is not reached.
CPU usage on the VM is below 50%.
Memory usage on the VM is below 30%.
I don't see any error logs on Postgres

SQLAlchemy Configuration:

My SQLAlchemy engine options are configured as follows:

SQLALCHEMY_ENGINE_OPTIONS = {

"pool_pre_ping": True,

"pool_size": 32,

"max_overflow": 16,

"pool_timeout": 300,

"connect_args": {

"connect_timeout": 300,

}

Note that I set high timeout times and I still experience timeouts

I've already checked VPC firewall rules to ensure that inbound traffic on port 5432 is allowed from the Cloud Run service's IP range to my VM. I'm using network-tags to allow traffic.

What I've Tried:

Monitored CPU and memory usage on the VM and experimented with larger VMs
Verified PostgreSQL's max_connections setting.
Checked firewall rules.

I'm looking for suggestions on how to further diagnose this problem and potential solutions. Any help or advice would be greatly appreciated. Please let me know if any other information would be helpful.

Thanks in advance!

iandlab

Hi @knet , thanks for your help! I did check the database settings, logs, etc, and it didn't seem to be the source of the problem.

However, I resolved the problem by stopping using network tags in the Firewall rule to allow Cloud Run to connect with the VM via Direct VPC. I started using the subnetwork's CIDR IP range instead, which did the trick. I found that solution based on these posts:

https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/DIRECT-VPC-for-cloud-run/m-p/...

https://stackoverflow.com/questions/79086615/cloud-run-direct-vpc-egress-connection-timeout-issue

View solution in original post

knet

I've heard that many databases don't work well when a large number of clients connect to them. I would suggest looking at your database's documentation to see if there's anything they say on this topic.

If the issue is too many Cloud Run instances, you might be able to run fewer, larger Cloud Run instances (more CPU/memory).

If the issue is too many IP addresses, you could try using VPC Connectors instead of Direct VPC; this would reduce the number of IPs connecting to the database.

If the issue is the number of connections/concurrent requests, you could try reducing the concurrency of your service, and running a larger number of smaller instances, each of which only processes a small number of requests.

Sorry I don't have more concrete advice.

iandlab

Hi @knet , thanks for your help! I did check the database settings, logs, etc, and it didn't seem to be the source of the problem.

However, I resolved the problem by stopping using network tags in the Firewall rule to allow Cloud Run to connect with the VM via Direct VPC. I started using the subnetwork's CIDR IP range instead, which did the trick. I found that solution based on these posts:

https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/DIRECT-VPC-for-cloud-run/m-p/...

https://stackoverflow.com/questions/79086615/cloud-run-direct-vpc-egress-connection-timeout-issue

maxxer

Thank you very much for sharing!! I had the same struggle in accessing a NFS server from Cloud Run to a VM (posted here), adding a firewall rule with the subnet instead of the network tag seem to have solved! I implemented the change yesterday, and in 24h I had no more mount time out issues!