Re: Batch no longer receives VM updates with exit ...

viniws · 07-14-2024 03:45 AM

Hi,

I am receiving the following error when trying to run a task in using Google Batch:

Job state is set from RUNNING to FAILED for job projects/556499559625/locations/australia-southeast2/jobs/nf-75119cb3-1720952766122. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/australia-southeast2-c/instances/4686181228540204777 due to Batch no longer receives VM updates with exit code 50002.

My logs don't indicate any errors. I have tried a lot of things, increasing resources, changing machine type and always run into this error. Would you have any suggestions?

I looked at this and seems to be a similar problem: https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/Batch-unknown-exit-code/m-p/6...

siddharthab

Hi!

The wording of the error message is confusing. It should be "Job exited with exit code 50002 - Batch no longer receives updates" to be more clear.

In my experience this happens because the VM is hung and the agent installed on the machine is unable to send messages to report job status. VM hang is most likely because of excessive memory pressure. You can check if the job you were running used up all the memory on the machine. This type of error will be fairly reproducible.

viniws

Hi @siddharthab thank you for your response.

I did have a similar error in a different job where I increased the memory (only slightly) and it worked.

What I'm finding odd, however, is that the job that is failing uses about 300 GB of RAM, as tested on a different system. I requested a VM with 2048 GB RAM and it still failed with the same error.

By the way, what would be the easiest way to check if the job used all the memory? I'm using the gcloud batch jobs describe command and it doesn't show that, only the amount of memory I requested.

siddharthab

You can observe the machine as the job is progressing, either through the Observability Tab for the VM running your job, or by SSH-ing into the machine and using something like `top`.

wenyhu

Hi @viniws,

FYI, Batch recently also starts to support the `installOpsAgent` field to help you install the Ops Agent on your behalf, and it will be supported by Client Libraries (e.g. python) soon. Ops agent can also help you monitor the metrics such as CPU or GPU.

Hope that can also help you better monitor the resource metrics in the future.

Thanks,

Wenyan

vaslem

For our case, the error is constantly occurring when multiple mounted files are open and read. No RAM overload should be expected, as a single line is read in each file, and single line is written in the output file (it is a merging operation). Anyways, I believe that this error needs to be revisited by Google, it is very cryptic and there is not enough documentation on how to deal with it.

wenyhu

Hi @vaslem,

Sorry we missed your post. Would you mind adding the topic label "Batch" to your post?

Thanks!

Wenyan

Wen_gcp

Hi @viniws ,

Could you please share more info about the files you used for the job, like the size of files, how you used these files?

When debugging a similar issue , I noticed a high CPU usage peak when processing files. If you could provide the metrics of vms, it can also be very helpful, you can find it by clicking into the vm -> OBSERVABILITY -> METRICS -> OVERVIEW.

We are using gcsfuse under the hood for GCS volume, here is more info about its performance and best practices: https://cloud.google.com/storage/docs/gcsfuse-performance-and-best-practices. Also, you can add topic lable `Cloud Storage` to this post.

To gather more logs and metrics to debug the issue, you can

1. enable detailed gcsfuse log using options like `--debug_fuse`: https://cloud.google.com/storage/docs/gcsfuse-cli#options.

2. enable install_ops_agent for batch job following: https://cloud.google.com/batch/docs/create-run-job-ops-agent

3. configure viewing logs from cloud_logging: https://cloud.google.com/batch/docs/analyze-job-using-logs

Thanks,

Wen