Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Batch - unknown exit code

Hi,all: 

I submitted a batch task, and it will automatically fail when it is running. Check the events above and it will prompt an unknown error. I would like to ask, what is the reason for this?

details:

Job state is set from RUNNING to FAILED for job projects/647012610224/locations/us-central1/jobs/ifr-etl-218235706939801633. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-a/instances/952685742931573065 with error Batch no longer receives VM updates. with unknown exit code

 

Solved Solved
1 6 1,096
1 ACCEPTED SOLUTION

Hi @JonYu,

Hope the original issue did not block you.

We improved the troubleshooting for the Batch related exit codes in https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes, including VM preemption. Hope next time it would help you triage.

Thanks!

View solution in original post

6 REPLIES 6

Hi JonYu, thanks for trying Batch!

From the information you provide, your tasks failed because Batch no longer receives the VM updates for some reason. Since you enabled CLOUD_LOGGING logs policy for your job, could you try troubleshooting with logs following https://cloud.google.com/batch/docs/analyze-job-using-logs to see whether there is any potential behavior happens during your job running that causes your VM no longer responses for a period?

Also, could you try to use `maxRetryCount` field following https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#taskspec to see whether retry helps for your case?

Hope the above helps, thanks!

Hi, @wenyhu 

I found the cause of the problem because my instance is of SPOT type and was preempted, but I did not find the relevant log in the GCP logger, but found it through the instance log. Can this be optimized?

 

Hi JonYu,

I would expect you find some preemption related logs when on your pantheon UI, you click on Logs -> LOGGING -> batch_agent_logs, if spot instance preemption is the cause of your job failure.

wenyhu_0-1689275490147.png

Would you mind sharing the logs you get for your job so that it also helps Batch to check whether the logs meet our expectation? 

Thanks!

Hi, @wenyhu 

I filtered some business logs, through the gp logs, I did not find the reason why my JOB exited.

JonYu_0-1689324429765.png

Reason: Instance eligible for autohealing: instance should be RUNNING, but is STOPPING."

Hi @JonYu,

Hope the original issue did not block you.

We improved the troubleshooting for the Batch related exit codes in https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes, including VM preemption. Hope next time it would help you triage.

Thanks!

Thanks a lot for adding this documentation!