Solved: Google Batch job VMs are terminated after 7d

timpalpant · 01-14-2025 09:05 AM

We are using Google Batch to run long-running jobs. According to the documentation jobs may run for up to 14d, but we see that the VMs running our jobs are terminated after 7d, leading to the Google Batch job failing with "Batch no longer receives VM updates with exit code 50002". We have not set any `maxRunDuration` in our task spec.

The logs show that there was a `compute.instances.deferredDelete` issued and the message:

Instance Group Manager 'ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0' initiated recreateInstance on instance 'projects/566203377590/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945'. Reason: Instance eligible for repair: Instance passed it's termination timestamp; termination_timestamp=2024-12-26T20:41:28.68375-08:00; current_time=2024-12-26T20:41:30.643952-08:00; current_status=STOPPING, target_status=STATUS_RUNNING.

Is there a way to prevent the VM from being killed after 7d?

If an example is helpful, these jobs all were killed in the same way after exactly 7d:

Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-2hcg9-1732734838
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-2hcg9-1-8a08b338-9b84-40c30-group0-0-99r5
Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-94afp-1734797267
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-94afp-1-0621dbf0-a10a-4d380-group0-0-r5lz
Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-vx3sg-1734669578
VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945

bolianyin

Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)? Using Dynamic Workload Scheduler will have a 7-day limit.

View solution in original post

bolianyin

Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)? Using Dynamic Workload Scheduler will have a 7-day limit.

timpalpant

They did! I forgot that DWS jobs had a 7-day limit. Thank you for your prompt support!