Hi,all:
I submitted a batch task, and it will automatically fail when it is running. Check the events above and it will prompt an unknown error. I would like to ask, what is the reason for this?
details:
Job state is set from RUNNING to FAILED for job projects/647012610224/locations/us-central1/jobs/ifr-etl-218235706939801633. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-a/instances/952685742931573065 with error Batch no longer receives VM updates. with unknown exit code
Solved! Go to Solution.
Hi @JonYu,
Hope the original issue did not block you.
We improved the troubleshooting for the Batch related exit codes in https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes, including VM preemption. Hope next time it would help you triage.
Thanks!
Hi JonYu, thanks for trying Batch!
From the information you provide, your tasks failed because Batch no longer receives the VM updates for some reason. Since you enabled CLOUD_LOGGING logs policy for your job, could you try troubleshooting with logs following https://cloud.google.com/batch/docs/analyze-job-using-logs to see whether there is any potential behavior happens during your job running that causes your VM no longer responses for a period?
Also, could you try to use `maxRetryCount` field following https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#taskspec to see whether retry helps for your case?
Hope the above helps, thanks!
Hi, @wenyhu
I found the cause of the problem because my instance is of SPOT type and was preempted, but I did not find the relevant log in the GCP logger, but found it through the instance log. Can this be optimized?
Hi JonYu,
I would expect you find some preemption related logs when on your pantheon UI, you click on Logs -> LOGGING -> batch_agent_logs, if spot instance preemption is the cause of your job failure.
Would you mind sharing the logs you get for your job so that it also helps Batch to check whether the logs meet our expectation?
Thanks!
Hi, @wenyhu
I filtered some business logs, through the gp logs, I did not find the reason why my JOB exited.
Reason: Instance eligible for autohealing: instance should be RUNNING, but is STOPPING."
Hi @JonYu,
Hope the original issue did not block you.
We improved the troubleshooting for the Batch related exit codes in https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes, including VM preemption. Hope next time it would help you triage.
Thanks!
Thanks a lot for adding this documentation!