How can I figure out why a task failed on Google batch?

 

I have 4 tasks failed. Using the command:

gcloud beta batch tasks list --job [JOB_NAME] --project [PROJECT_NAME] --location [LOCATION] --page-size 500

I can get a list of the tasks. Then, to filter for the logs of a specific task:

Name="projects/[PROJECT_NAME]/logs/batch_task_logs" labels.task_id="[TASK_ID]" timestamp>="[TIMESTAMP]" severity>=DEFAULT

However, while this gives me the stdout/stderr of a specific task, I don't see why the task failed. I'm guessing the task failed for a reason--like--out of memory, or maybe (since the tasks are running on spot instances), the host instance got killed. Not sure, but just looking at the process's stderr/stdout won't be sufficient. How can I view the complete logs for the task, so I can figure out why it failed?

2 5 260
5 REPLIES 5

Hi @vedantroy-genmo,

Batch also records logs in `batch_agent_logs`, can you also try to get more logs info from there?

Ref: https://cloud.google.com/batch/docs/analyze-job-using-logs.

Another way to collect the entire logs is to go to Cloud Logging and search for the information you want.

Thanks,

Wenyan

Ok. If I search "Task task/<my task name>" and "exited with status", I can see tasks that failed. The problem is, the error code is opaque.

Task task/mytask-group0-281/0/0 runnable 0 exited with status 125

 

Do you have any thoughts on what might be happening?

Hi @vedantroy-genmo,

The exit code 125 usually means your container runnable task failed on docker command as "container failed to run". There can be multiple reasons that cause this issue. E.g. if you are using GPU, that might because your GPU driver installation is not successful. Or maybe the container image you are using for your task has some issue, or your command has some error. The `batch_task_logs` gives you task related log details, and the `batch_agent_logs` gives more details about required package installation from Batch. I would recommend you combine this two types of logs together to have a investigation.

In the meantime, you can also share your logs and your job uid and region to Batch, in case you want us to help investigate.

Thanks!

Wenyan

Sure. I'm running the job in us-central1 (not sure if zone is us-central-1a, or multiple zones), and the UID is "internvid-md5-v1-a9339995-3599-48f6-b0".

One issue is, I think the 125 is because of a spot pre-emption notice, but from reading the docs, batch tasks should exit with 50001 following spot pre-emption. For now, I can configure the VMs to retry on exit code 125, but I wonder if I'm hiding a deeper bug.

This is log results from filtering on a hostname:  you can see there's a spot pre-emption notice, and then the runnable exits with code 125.

Screenshot 2024-04-02 at 1.15.26 PM.png

 

Hi @vedantroy-genmo,

Yes you find the proper info. And with the job and task information you provide, I checked that your task is failed due to spot preemption.

I would assume if you do Get Task API call for that task `internvid-md5-v1-a9339995-3599-48f6-b0-group0-305` in the snapshot, you should be able to see the task's status event with error code 50001 as 

 
In gcloud, it would be as `gcloud batch tasks describe {YOUR_FULL_TASK_NAME}`.
 
Thanks,
Wenyan