Persistent 503 Error at 120s Despite 300s Timeout ...

Maximkaa · 03-07-2025 05:02 PM

Hi everyone,

I’m at my wits’ end with a persistent issue on my Cloud Run service, `deepfake-news-detector-api` in the europe-west1 region. My service worked perfectly days ago but now fails with a 503 error after exactly 120 seconds, despite needing 3-4 minutes (180-240s) for a cold start. I need assistance to get it back online.

**Details:**
- **What Worked**: Days ago, the service ran fine with 16 GiB memory, 4 CPUs, and a 300-second timeout, handling the cold start without issues.
- **Current Issue**: Since a few days ago, it consistently returns 503 after 120 seconds. I’ve tried:
- Memory: 16 GiB.
- CPU: 4.
- Request Timeout: 300s (also tried 600s).
- Startup Probe: Initial Delay 60s, Timeout 600s, Failure Threshold 5.
- Minimum Instances: 0.
- **Logs**: Show a SIGABRT (segmentation fault) in `gunicorn/arbiter.py` during worker initialization, followed by “Startup probe failed” or instance shutdown.
- **Latest Logs (via CLI)**: [Paste the last 5 lines from `gcloud logging read` here if you have them—run the command below first.]

**Command to Reproduce:**
```cmd
gcloud logging read "resource.type=cloud_run_revision resource.labels.service_name=deepfake-news-detector-api resource.labels.region=europe-west1" --limit=5 --freshness=1h --format="value(textPayload)"

What I’ve Tried:

Reverted to the original working setup (16 GiB, 4 CPUs, 300s timeout).
Adjusted startup probe to wait up to 50 minutes (5 x 600s).
Cleared environment variables (e.g., GUNICORN_TIMEOUT).
Deployed via both CLI and Console—same 120s 503.

Can anyone from the community or Google explain why the 120s limit persists despite my settings? Is this a Cloud Run bug? How do I fix the SIGABRT crash? I need my service back online urgently—please help!

Thanks,
[Maxim]

greb

Hi @Maximkaa,

Welcome to Google Cloud Community!

According thru my observation, your Cloud Run Service seems to have a fixed 120 seconds cold start timeout while encountering the 300s or 600s timeout settings. The SIGABRT triggers in gunicorn/arbiter.py suggests that there is an issue with process startup which is likely due to faulty memory allocation or threading issues.

Possible Causes & Fixes

Google’s 120s Cold Start Limit

Cloud Run might enforce an undocumented startup cap under some conditions. Check Cloud Run Release Notes.

Startup Probe Adjustments

Set initialDelaySeconds=20, failureThreshold=10, and periodSeconds=10.
Log debug info in gunicorn.conf.py.

Reduce Memory Usage

Limit GUNICORN_WORKERS to 1 and track memory usage (gcloud logging read with severity>=ERROR).

Workaround: Keep Instances Warm

Set a lightweight warm-up service to ping Cloud Run every few minutes.

Test in GCE VM or "CPU Always Allocated" Mode

This avoids cold starts altogether.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Persistent 503 Error at 120s Despite 300s Timeout and Probe Settings - Need Urgent Help