Solved: Best practice to ensure that Batch Jobs do not sto...

ChrisMICDUP · 07-16-2024 06:03 PM

I have a Batch job started by a GCP Workflow, I have recently received a series of ZONE_RESOURCE_POOL_EXHAUSTED error and the workflow process eventually timed out after 30 minutes. The current location is "us-central1" and I assume that resources were checked from all zones in that region. I was hoping I could specify other locations with allowedLocations but I see that having multiple regions is not permitted.

call: googleapis.batch.v1.projects.locations.jobs.create
args:
  parent: ${"projects/" + project + "/locations/" + location}

What is the best practise for avoiding or recovering from these errors? I see two possibilities.

1. I could catch the timeout error from the create call

"{"message":"Timeout of 1800 seconds exceeded. The timeout occurred during operation status polling.","tags":["TimeoutError","OSError"]}"

Then switch regions to us-east1 and/or machine type or similar and try again, but that means I am already 30 minutes in the hole. I guess I could lower the timeout value...

2. I could set up a reserved instance for the machine type and region for 730 hours a month, but then how do I handle concurrent requests?

Can you point me to any resources (I have looked BTW)

bolianyin

If your job is not limited to particular zones, Batch service does look for VMs in all available zones. Both approaches you mentioned are good workarounds. For #2, after you create a reservation, multiple Batch jobs should be able to use it concurrently unless all VMs are occupied by previous jobs.

Relatively soon, we'll provide a feature where you could you specify more than one machine types for a Batch job to mitigate availability issues.

View solution in original post

bolianyin

If your job is not limited to particular zones, Batch service does look for VMs in all available zones. Both approaches you mentioned are good workarounds. For #2, after you create a reservation, multiple Batch jobs should be able to use it concurrently unless all VMs are occupied by previous jobs.

Relatively soon, we'll provide a feature where you could you specify more than one machine types for a Batch job to mitigate availability issues.

Best practice to ensure that Batch Jobs do not stop/stall with ZONE_RESOURCE_POOL_EXHAUSTED errors