Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Example to show batch job retry in GCP Workflow

Hi,

I have a Workflow that creates a batch job task that runs up a docker image to do image processing then exits. I use VM Spot instances. All is working fine.

Now I would like to implement and test a simple retry mechanism if the VM is pre-empted and warn DevOps if the task fails.

I need an example of the Workflow Task retry syntax, the ability to manually pre-empt the VM (when using the gcloud compute instances stop command I need to have the VM ID), and a way to query the exit code in a subsequent Workflow step.

Workflow snippet (deploys but not tested):

- create_transcoding_job:

call: googleapis.batch.v1.projects.locations.jobs.create
args:
parent: ${"projects/" + project + "/locations/" + location}
jobId: "${jobId}"
body:
priority: 99
taskGroups:
- taskCount: 1
parallelism: 1
taskSpec:
computeResource:
...
- maxRetryCount: 3
- lifecyclePolicies:
# If VM preempted (error code 50001) retry 3 times
- action: RETRY_TASK
actionCondition:
exitCodes: [ 50001 ]
allocationPolicy:
instances:
- policy:
provisioningModel: SPOT
machineType: "${machineType}"
...

If retries exhausted, I'd like to notify DevOps

 

 

Solved Solved
4 3 2,267
1 ACCEPTED SOLUTION

Hello,

1. The retry syntax looks good to me, 50001 is the right code for preemption.

2. To test preemption, you can simulate maintenance event for VMs. What I did previously is 

gcloud compute instances set-scheduling VM_NAME --maintenance-policy=TERMINATE --zone ZONE

gcloud compute instances simulate-maintenance-event VM_NAME --zone ZONE

Btw, you can find your VM names in your Cloud Console, they should be prefixed with the same job id.

3. One thing to mention is that you can not query exit code now. Exit code is only available in the task event description, so for now when a task is failed, you have to parse the description message. For example, description like "task is failed due to Spot preemption with code 50001", you need to parse the code from the message.

View solution in original post