GKERunStartPodOperator in deferable mode=True and ...

gkadam2011 · 11-14-2023 08:12 PM

I am trying GKERunStartPodOperator to launch pods on existing GKE cluster user deferrable mode=true and do xcom=true. The operator is able to launch the pods, container is getting executed but the

1. base container is complet

2.side care container continues to run

and because of #2 or I don't, the task is not able come out and on its own as expected in durable mode.

Args Used:

self.namespace=dag_config.get("gke_name_space")

self.cluster_name=dag_config.get("gke_cluster_name")

self.location=dag_config.get("region_name")

self.project_id=dag_config.get("project_id")

self.name=pod_name

self.image=spec["containers"][0]["image"]

self.full_pod_spec=deserialized_pod_spec

self.in_cluster=False

self.get_logs=True

self.do_xcom_push=True

self.deferrable=True

# self.is_delete_operator_pod=True

self.on_finish_action="delete_pod"

self.random_name_suffix=False

self.poll_interval=2

self.termination_grace_period=0

Not able to figure out why this is happing? Any pointer or help on this. any settings that are must when deferable=True? Do we need to define any triggers or sensor. My cloud composer has triggered running.

Log Snippet

[2023-11-15, 02:55:48 UTC] {credentials_provider.py:353} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided. [2023-11-15, 02:56:18 UTC] {taskinstance.py:1392} INFO - Pausing task as DEFERRED. dag_id=dpdf-pt-dev-dag-CloudflowGKEPodOperator, task_id=launch_gke_pods, execution_date=20231115T025458, start_date=20231115T025546 [2023-11-15, 02:56:18 UTC] {local_task_job.py:212} INFO - Task exited with return code 0 [2023-11-15, 02:56:18 UTC] {taskinstance.py:2599} INFO - 0 downstream tasks scheduled from follow-on schedule check

--------------------------------------------------------------------------------
[2023-11-15, 02:56:33 UTC] {taskinstance.py:1290} INFO - Starting attempt 1 of 1
[2023-11-15, 02:56:33 UTC] {taskinstance.py:1291} INFO -
--------------------------------------------------------------------------------
[2023-11-15, 02:56:33 UTC] {taskinstance.py:1310} INFO - Executing <Task(XXXXXXXXXGKEStartPodOperator): launch_gke_pods> on 2023-11-15 02:54:58+00:00
[2023-11-15, 02:56:33 UTC] {standard_task_runner.py:55} INFO - Started process 64561 to run task
[2023-11-15, 02:56:33 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'dpdf-pt-dev-dag-XXXXXXXXXGKEPodOperator', 'launch_gke_pods', 'manual__2023-11-15T02:54:58+00:00', '--job-id', '31440', '--raw', '--subdir', 'DAGS_FOLDER/dev/rst/qqqq-pp-dev-dag-XXXXXXXXXGKEPodOperator.py', '--cfg-path', '/tmp/tmps5mev5lk']
[2023-11-15, 02:56:33 UTC] {standard_task_runner.py:83} INFO - Job 31440: Subtask launch_gke_pods
[2023-11-15, 02:56:33 UTC] {task_command.py:393} INFO - Running <TaskInstance: dpdf-pt-dev-dag-XXXXXXXXXGKEPodOperator.launch_gke_pods manual__2023-11-15T02:54:58+00:00 [running]> on host airflow-worker-whsmn
[2023-11-15, 02:56:34 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
[2023-11-15, 02:56:34 UTC] {credentials_provider.py:353} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided.
[2023-11-15, 02:56:34 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:36 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:38 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:40 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:42 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:44 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:46 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:48 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:50 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:52 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:54 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:56 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:56:58 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running
[2023-11-15, 02:57:00 UTC] {pod_manager.py:516} INFO - Pod abc-xyzuvt-pod-p6yczain6wi8 has phase Running

ms4446

The issue with the task not completing as expected in Airflow's GKERunStartPodOperator seems to be related to the behavior of the sidecar container. When the on_finish_action parameter is set to "delete_pod", Airflow is instructed to delete the pod once the main container's task is completed. However, if the sidecar container continues to run, it suggests that it's not correctly configured to terminate alongside the main container.

To address this issue, consider the following steps:

Review Sidecar Container Configuration: Ensure that the sidecar container is set up to terminate when the main container's task is done. This might involve implementing proper signaling mechanisms or health checks within the sidecar container.
Adjust on_finish_action for Debugging: Temporarily set on_finish_action to None. This will prevent Airflow from deleting the pod immediately after the main task completion, allowing you to observe and debug the behavior of the sidecar container.
Set termination_grace_period: Use a non-zero termination_grace_period to give the containers in the pod time to shut down gracefully. This is a good practice and can help in ensuring smooth termination of all containers in the pod.

Here's an example of how you might adjust your operator configuration for debugging:

operator = GKERunStartPodOperator(
    task_id="launch_gke_pods",
    name=pod_name,
    image=spec["containers"][0]["image"],
    full_pod_spec=deserialized_pod_spec,
    deferrable=True,
    on_finish_action=None,  # Temporarily set to None for debugging
    do_xcom_push=True,
    termination_grace_period=30  # Set a non-zero grace period
)

gkadam2011

Thanks, Let me try this and get back to you. Are there any methods to set up Sidecar Container Configuration?

gkadam2011

I tried the above step but still the same issue

on_finish_action=delete_succeeded_pod,  (None is not accepted here, keep_pod, delete_succeeded_pod or delete_pod are the only 3 values)   
do_xcom_push=True,
termination_grace_period=30  # Set a non-zero grace period

ms4446

It appears that the issue with the sidecar container not terminating as expected in the GKERunStartPodOperator persists even after setting on_finish_action to delete_succeeded_pod. This suggests that the problem might be related to the configuration of the sidecar container itself, rather than Airflow's handling of pod termination.

To further investigate and resolve this issue, consider the following steps:

Understand Kubernetes Pod Lifecycle: Familiarize yourself with how Kubernetes handles the lifecycle of pods and containers, especially the termination process. Kubernetes sends termination signals (like SIGTERM) to all containers in a pod when it's being terminated. Ensure that your sidecar container is configured to handle these signals appropriately.
Implement Health Checks in the Sidecar Container: Set up proper health checks within the sidecar container. This can involve using liveness and readiness probes to monitor the container's health and responsiveness. Additionally, consider implementing a preStop hook to gracefully handle the shutdown process.
Check Sidecar Container Initialization: Ensure that the sidecar container initializes correctly and does not enter a state that prevents it from responding to termination signals. This might involve reviewing the container's startup scripts or entrypoint configurations.
Monitor Logs for Insights: Examine the logs of the sidecar container during task execution. Look for any error messages or indications of abnormal behavior that could provide clues about why the container is not terminating as expected.
Test in a Controlled Environment: If possible, isolate the sidecar container in a separate test environment. Observe its behavior and termination process independently from the main container. This can help identify if the issue is specific to the container's configuration or its interaction within the pod.
Review Resource and Quota Limits: Verify that the Kubernetes cluster has sufficient resources and that your pod is not hitting any resource or quota limits that might affect container behavior.
Check Version Compatibility: Ensure that there are no compatibility issues between the versions of Kubernetes, Airflow, and the GKERunStartPodOperator. Sometimes, bugs or quirks are specific to certain version combinations.

RajkumarAyyasam

You can just set do_xcom_push = false to avoid creating side care containers.

GKERunStartPodOperator in deferable mode=True and not able to cleanup pods