Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Best practice to ensure that Batch Jobs do not stop/stall with ZONE_RESOURCE_POOL_EXHAUSTED errors

I have a Batch job started by a GCP Workflow, I have recently received a series of ZONE_RESOURCE_POOL_EXHAUSTED error and the workflow process eventually timed out after 30  minutes. The current location is "us-central1" and I assume that resources were checked from all zones in that region. I was hoping I could specify other locations with allowedLocations but I see that having multiple regions is not permitted.

call: googleapis.batch.v1.projects.locations.jobs.create
args:
parent: ${"projects/" + project + "/locations/" + location}

What is the best practise for avoiding or recovering from these errors? I see two possibilities.

1. I could catch the timeout error from the create call

"{"message":"Timeout of 1800 seconds exceeded. The timeout occurred during operation status polling.","tags":["TimeoutError","OSError"]}"

Then switch regions to us-east1 and/or machine type or similar and try again, but that means I am already 30 minutes in the hole. I guess I could lower the timeout value...

2. I could set up a reserved instance for the machine type and region for 730 hours a month, but then how do I handle concurrent requests?

Can you point me to any resources (I have looked BTW)

Solved Solved
2 1 739
1 ACCEPTED SOLUTION

If your job is not limited to particular zones, Batch service does look for VMs in all available zones. Both approaches you mentioned are good workarounds. For #2, after you create a reservation, multiple Batch jobs should be able to use it concurrently unless all VMs are occupied by previous jobs. 

Relatively soon, we'll provide a feature where you could you specify more than one machine types for a Batch job to mitigate availability issues.

View solution in original post

1 REPLY 1

If your job is not limited to particular zones, Batch service does look for VMs in all available zones. Both approaches you mentioned are good workarounds. For #2, after you create a reservation, multiple Batch jobs should be able to use it concurrently unless all VMs are occupied by previous jobs. 

Relatively soon, we'll provide a feature where you could you specify more than one machine types for a Batch job to mitigate availability issues.