We are using Google Batch to run long-running jobs. According to the documentation jobs may run for up to 14d, but we see that the VMs running our jobs are terminated after 7d, leading to the Google Batch job failing with "Batch no longer receives VM updates with exit code 50002". We have not set any `maxRunDuration` in our task spec.
The logs show that there was a `compute.instances.deferredDelete` issued and the message:
Instance Group Manager 'ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0' initiated recreateInstance on instance 'projects/566203377590/zones/us-central1-b/instances/ruisu-surf-vx3sg-1-8f7a5c2c-f9d7-4bca0-group0-0-9945'. Reason: Instance eligible for repair: Instance passed it's termination timestamp; termination_timestamp=2024-12-26T20:41:28.68375-08:00; current_time=2024-12-26T20:41:30.643952-08:00; current_status=STOPPING, target_status=STATUS_RUNNING.
Is there a way to prevent the VM from being killed after 7d?
If an example is helpful, these jobs all were killed in the same way after exactly 7d:
Solved! Go to Solution.
Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)? Using Dynamic Workload Scheduler will have a 7-day limit.
Did the jobs run on Dynamic Workload Scheduler with Batch (https://cloud.google.com/batch/docs/create-run-job-gpus#select-provisioning-method)? Using Dynamic Workload Scheduler will have a 7-day limit.
They did! I forgot that DWS jobs had a 7-day limit. Thank you for your prompt support!