Gunicorn workers on Google App Engine randomly sending SIGKILL leading to TIMEOUT. What is going on?

This happens randomly, sometimes when a new instance is spun up (example below), sometimes when an instance is already been running for some time.

This error cycle of "Boot worker - SIGKILL - TIMEOUT" can last anywhere from 10s to more than an hour.

This can also happen to any endpoint in my application, it has happened to POST requests, GET requests on different endpoints.

Initially I thought that it was due to some bug in my code or some malformed data in the POST request but after investigation I realized that sending the exact same POST/GET request when the instance is not stuck in this error loop works perfectly fine.

Below are the logs from an example where a POST request that led to a new instance being spun up was stuck in this loop for over an hour and suddenly resolves itself and handles the POST request normally with status 200.

 

 

 

 

DEFAULT 2024-04-07T07:56:06.463491Z [2024-04-07 07:56:06 +0000] [11] [INFO] Starting gunicorn 21.2.0
DEFAULT 2024-04-07T07:56:06.464623Z [2024-04-07 07:56:06 +0000] [11] [DEBUG] Arbiter booted
DEFAULT 2024-04-07T07:56:06.464637Z [2024-04-07 07:56:06 +0000] [11] [INFO] Listening at: http://0.0.0.0:8081 (11)
DEFAULT 2024-04-07T07:56:06.464712Z [2024-04-07 07:56:06 +0000] [11] [INFO] Using worker: gthread
INFO 2024-04-07T07:56:06.466397Z [pid1-nginx] Successfully connected to 127.0.0.1:8081 after 3.361200658s [session:RB4VHXB]
INFO 2024-04-07T07:56:06.466735Z [pid1-nginx] Successfully connected to localhost:8081 after 3.361617817s [session:RB4VHXB]
INFO 2024-04-07T07:56:06.469263Z [pid1-nginx] Creating config at /tmp/nginxconf-644813283/nginx.conf [session:RB4VHXB]
INFO 2024-04-07T07:56:06.472325Z [pid1-nginx] Starting nginx (pid 17): /usr/sbin/nginx -c /tmp/nginxconf-644813283/nginx.conf [session:RB4VHXB]
DEFAULT 2024-04-07T07:56:06.544837Z [2024-04-07 07:56:06 +0000] [16] [INFO] Booting worker with pid: 16
DEFAULT 2024-04-07T07:56:06.576102Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] Closing connection.
DEFAULT 2024-04-07T07:56:06.577868Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] Closing connection.
DEFAULT 2024-04-07T07:56:06.579116Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] GET /_ah/start
DEFAULT 2024-04-07T07:56:06.618933Z [2024-04-07 07:56:06 +0000] [19] [INFO] Booting worker with pid: 19
DEFAULT 2024-04-07T07:56:06.666217Z [2024-04-07 07:56:06 +0000] [11] [DEBUG] 2 workers
DEFAULT 2024-04-07T07:56:06.739491Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:57:07.717148Z [2024-04-07 07:57:07 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:16)
DEFAULT 2024-04-07T07:57:08.720797Z [2024-04-07 07:57:08 +0000] [11] [ERROR] Worker (pid:16) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:57:08.720910Z 2024/04/07 07:57:08 [error] 18#18: *6 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:57:08.824712Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.15 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T07:57:08.898539Z [2024-04-07 07:57:08 +0000] [19] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:57:08.990455Z [2024-04-07 07:57:08 +0000] [27] [INFO] Booting worker with pid: 27
DEFAULT 2024-04-07T07:58:08.968963Z [2024-04-07 07:58:08 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:19)
DEFAULT 2024-04-07T07:58:09.973588Z 2024/04/07 07:58:09 [error] 18#18: *7 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:58:09.973611Z [2024-04-07 07:58:09 +0000] [11] [ERROR] Worker (pid:19) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:58:10.106688Z [2024-04-07 07:58:10 +0000] [33] [INFO] Booting worker with pid: 33
DEFAULT 2024-04-07T07:58:10.177760Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.976 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T07:58:10.196059Z [2024-04-07 07:58:10 +0000] [33] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:59:11.149239Z [2024-04-07 07:59:11 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:33)
DEFAULT 2024-04-07T07:59:12.153215Z 2024/04/07 07:59:12 [error] 18#18: *9 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:59:12.153281Z [2024-04-07 07:59:12 +0000] [11] [ERROR] Worker (pid:33) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:59:12.307443Z [2024-04-07 07:59:12 +0000] [39] [INFO] Booting worker with pid: 39
DEFAULT 2024-04-07T08:00:45.632391Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.725 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
...
Repeat of same, POST request recieved, worker boot, worker timeout then worker sent SIGKILL for the next 1 hour.
...

DEFAULT 2024-04-07T09:01:47.742589Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.369 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:01:47.916781Z [2024-04-07 09:01:47 +0000] [387] [INFO] Booting worker with pid: 387
DEFAULT 2024-04-07T09:01:48.003333Z [2024-04-07 09:01:48 +0000] [387] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:13.317927Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 86.175 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:36.933886Z [2024-04-07 09:02:36 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:381)
DEFAULT 2024-04-07T09:02:37.938484Z [2024-04-07 09:02:37 +0000] [11] [ERROR] Worker (pid:381) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T09:02:37.938619Z 2024/04/07 09:02:37 [error] 18#18: *140 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T09:02:38.097720Z [2024-04-07 09:02:38 +0000] [393] [INFO] Booting worker with pid: 393
DEFAULT 2024-04-07T09:02:38.142051Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.351 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:38.189106Z [2024-04-07 09:02:38 +0000] [393] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:38.196806Z [2024-04-07 09:02:38 +0000] [393] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:48.105780Z [2024-04-07 09:02:48 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:387)
DEFAULT 2024-04-07T09:02:49.112205Z 2024/04/07 09:02:49 [error] 18#18: *142 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T09:02:49.112209Z [2024-04-07 09:02:49 +0000] [11] [ERROR] Worker (pid:387) was sent SIGKILL! Perhaps out of memory?

Finally it processes the POST request correctly with status 200.

DEFAULT 2024-04-07T09:02:49.114051Z [protoPayload.method: POST] [protoPayload.status: 200] [protoPayload.responseSize: 135 B] [protoPayload.latency: 81.691 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:49.367448Z [2024-04-07 09:02:49 +0000] [401] [INFO] Booting worker with pid: 401
DEFAULT 2024-04-07T09:02:49.464783Z [2024-04-07 09:02:49 +0000] [401] [DEBUG] POST /processNewOrder

 

 

I did also check the memory usage but there does not seem to be an issue. I am using a B2 instance class which should have 768 MB of RAM and during the time period of the incident RAM usage was low at roughly ~210 MB.

6c09k.png

  • This issue also never ever occurs when I am running the Gunicorn server via App Engine locally.

  • I have tried to increase the instance size to B4 (1536 MB RAM) thinking that maybe the Memory Usage is not being reported correctly, but the same problem occurs.

  • I have tried to increase the timeout parameter in my Gunicorn config file to 20 minutes but the same problem still occurs, just that now each worker will take 20mins before timing out.

  • I also tried to set preload_app=True in my Gunicorn config file as I thought maybe the issue is caused by the worker being spun up but the application code is not ready, but the same problem still occurs.

Given how randomly it occurs, I am not able to reliably reproduce the issues, hence cannot find a solution to this problem. One possibility is that there is some kind of outage on Google Cloud's end, but before we go there, I wanted to see if anyone else has faced this issue before and can provide any advise on a potential solution. Thanks in advanced for any comments!

0 2 310
2 REPLIES 2

If your service is in us-central1 it is likely related to this incident which took 3 days for Google to respond to and which was never reported as an incident. 

Is your issue resolved? 

Hi @Naianasha thank you for your reply. My service is in asia-southeast2 but if Google servers in my region was also having issues, that would definitely explain the random SIGKILL. However, this happens quite often (almost everyday in my region) and if this is really caused by GAE internal issue, then I would really need to consider moving to another service/platform.