Composer workers are restarting frequently

ankit_rawat28 · 03-19-2025 08:58 AM

@pgstorm148 I am using composer-2 [ composer-2.8.5-airflow-2.7.3 ] version in which i am running min= 2 scheduler, min= 2 worker and etc.

I have been observing on Composer GKE workloads that my both worker containers restarting continuously after sometime even no DAG or Task is running. I have just observed that it shows that liveness probe failed.

Now i am not able to figure out why this is happening or how can i resolve it because we can not make any changes on GKE autopilot. I believe liveness failed is causing the container to restart which i believe check by airflow_monitoring dag created by itself [ correct me if i am wrong ].

Can anyone help me to understand and resolve this issue. Attaching screenshot. Screenshot 2025-03-19 at 9.15.13 PM.png

pgstorm148

Well, i see this is running since a long time buddy, 449 restarts, quite a number;
for time being what you can do is,
investigate worker logs first before failures.
try this command once,

gcloud logging read 'resource.type="cloud_composer_environment" AND resource.labels.environment_name="YOUR_ENVIRONMENT_NAME" AND resource.labels.location="YOUR_LOCATION" AND logName:"worker" AND timestamp<="2025-03-19T19:09:14Z" AND timestamp>="2025-03-19T19:08:14Z"'

This will show you logs from the minute before the most recent failure.

As its liveliness probe, check the memory pressure once, through cloud monitoring.
Even though no DAGs appear to be running, there might be scheduled DAGs that are resource-intensive during parsing or scheduling.

If nothing works or you're still in doubt,
try this thing.

gcloud composer environments update YOUR_ENVIRONMENT_NAME \
  --location YOUR_LOCATION \
  --worker-cpu 4 \
  --worker-memory 16

Adjust Airflow configurations: You can modify these through the Composer UI or gcloud:

worker_timeout: Increase this value
worker_concurrency: Lower this value to reduce load on workers
dag_concurrency: Reduce this if you have many DAGs

And based on frequency of restarts,
try to move this to a larger environment.
If nothing works, raise a google support case once.

Let me know if any of these solve your issue.

ankit_rawat28

ok @pgstorm148 Thanks for response. let me try this. Just to add further I have checked the logs and Monitoring Metrics All scheduler, All worker resources are under utilise.

ankit_rawat28

Currently, Environment seems working fine. We still receiving the worker restart but it is very low 2 restarts in 24 hours.
I have also observed that on my triggers logs i was receiving the warning such as below.

Triggerer's async thread was blocked for 0.65 seconds, likely due to the highly utilized environment.

Following this doc https://cloud.google.com/composer/docs/composer-2/troubleshooting-triggerer
I have increased my trigger instance from minimum 1 to minimum 2 and increase the cpu from 0.5 to 1 cpu. I was also noticing that my deferred task was booming suddenly from 0 to 70 b/w 2-3 minutes, which might be causing the delay to execute task by workers.

Another thing i came to know that Composer GKE reserved 25% resources for itself, I have increased the memory also.
https://cloud.google.com/composer/docs/composer-2/debug-out-of-memory-and-out-of-storage-dag-issues

GKE reserves 25% of the first 4 GiB of memory. GKE also reserves an additional eviction-threshold: 100 MiB of memory on each node for kubelet eviction.

Points to look for:
1. Monitor all the health of each composer components during unhealthy or worker restart.
2 .Reduce the number of deferred tasks that are executed at the same time.
3. Increase the CPU, Memory of workers based of GKE 25 % reservation.

This points initially can help you to debug and resolve the issue.