Network egress idle timeout issues with Cloud Run

piotrekkr · 02-21-2025 01:28 AM

I'm running app image (PHP) in Cloud Run and for few days now I have issues with egress traffic from Cloud Run service in multiple GCP projects. Errors like this:

cURL error 28: Resolving timed out after 2103 milliseconds (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://xxxx.my.salesforce.com

Idle timeout reached for https://maps.googleapis.com

Idle timeout reached for "https://api.clerk.com/".

cURL error 28: Resolving timed out after 2299 milliseconds (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://xxxx.sandbox.my.salesforce.com

Send failure: Broken pipe for "https://example.com".

cURL error 35: Send failure: Connection reset by peer (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://test.salesforce.com/services/oauth2/token

It seems like service can connect to host but then data is not sent between CR and destination. It correlate with increased concurrent connections. With low traffic (5-10 concurrent) it works just fine. When there is increase (like 30+ concurrent requests) it starts to happen and not on every request but on most of endpoints that call those external services.

It first started to happen on our testing environments and was like this for a week. Production environment had same app image versions and were working without issues. However, yesterday even production environments started to show same issues.

I did not do any significant changes in code or infrastructure in last few weeks so it is quite unexpected to happen. I've checked status pages for all those external services and there were no issues on their side.

I've checked CR service metrics and there is nothing alarming there. Even with traffic spikes it is creating 3-7 service instances and memory usage is around 50%, CPU usage like 40%.

Cloud run service is using Direct VPC Connection but it is configured for only internal hosts so external traffic should go through global CR network.

I've tried things like:
- decrease service max concurrency setting to lower than 20 so it will create more instances
- downgrade php versions in container
- redeploying new revisions manually when issues occur
Nothing seem to help.

Has anyone observed similar issues? How can I debug this? Thanks

// EDIT 2025-03-04

Issue seemed to fix itself for few days now but it started to happen again today. SSL timeouts, DNS resolve issues and idle timeouts again.

mcbsalceda

Hi @piotrekkr,

Welcome to Google Cloud Community!

It sounds like your Cloud Run service might be running into connection timeout issues, especially as concurrency increases. Cloud Run has idle timeouts—10 minutes for VPC requests and 20 minutes for internet traffic. If your app keeps connections open beyond these limits, the gateway will close them.

Also, outbound connections can occasionally reset due to infrastructure updates. If your app reuses long-lived connections, it’s a good idea to ensure it can detect and re-establish them automatically to avoid using dead connections.

If the issue persists despite these changes, it might be worth reaching out to Google Cloud Support for further investigation.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

piotrekkr

I can imagine that this could happen if I keep connections for minutes but I'm talking more about 10-15 sec timeouts which should be way above what is needed by Google Maps or other API we are calling. What is even more troubling is we tend to have connection issues like DNS resolving timeouts, SSL issues and sometimes connection resets with increased concurrent connections to our services. I know that sometimes API responses are slow so there may be a timeout if we expect responses below some threshold but to not be able to even connect to multiple external APIs is really weird. I've even decreased concurrency to 15 but this did not help much. It can happen just after deploy or some time after. With low concurrent requests we don't have such issues. Memory and CPU usage is low even with traffic spikes so I have no idea whets going on.

hungig

We are having the same issue right now. We didn't change anything in the infra.

piotrekkr

I've set CR concurrency to 15 per instance and seems to do better but it can also be something fixed on GCP side. I'll increase concurrency to test when I have time. Probably next week.