Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Google Batch job VMs are terminated after 7d

We are using Google Batch to run long-running jobs. According to the documentation jobs may run for up to 14d, but we see that the VMs running our jobs are terminated after 7d, leading to the Google Batch job failing with "Batch no longer receives VM updates with exit code 50002". We have not set any `maxRunDuration` in our task spec.

The logs show that there was a `compute.instances.deferredDelete` issued and the message:

Instance Group Manager 'ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0' initiated recreateInstance on instance 'projects/566203377590/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945'. Reason: Instance eligible for repair: Instance passed it's termination timestamp; termination_timestamp=2024-12-26T20:41:28.68375-08:00; current_time=2024-12-26T20:41:30.643952-08:00; current_status=STOPPING, target_status=STATUS_RUNNING.

Is there a way to prevent the VM from being killed after 7d?

If an example is helpful, these jobs all were killed in the same way after exactly 7d:

  • Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-2hcg9-1732734838
    VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-2hcg9-1-8a08b338-9b84-40c30-group0-0-99r5
  • Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-94afp-1734797267
    VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-94afp-1-0621dbf0-a10a-4d380-group0-0-r5lz
  • Batch job: projects/drailab/locations/us-central1/jobs/ruisu-surf-vx3sg-1734669578
    VM instance: projects/drailab/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945
Solved Solved
0 2 1,299
1 ACCEPTED SOLUTION

Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)?  Using Dynamic Workload Scheduler will have a 7-day limit.

View solution in original post

2 REPLIES 2

Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)?  Using Dynamic Workload Scheduler will have a 7-day limit.

They did! I forgot that DWS jobs had a 7-day limit. Thank you for your prompt support!