Re: google batch job fails randomly when using ena...

breathe · 08-21-2024 03:34 PM

I'm running google batch with enable_image_streaming set to False currently and have been trying to enable the container image streaming feature in order to reduce the time required from creating the job to running my application's code.

When I change enable_image_streaming to True -- some of the jobs fail on launch (randomly). The jobs are all pulling the same container image from the same artifact registry in the same region.

The failing jobs fail on launch with exit code 202 -- with the batch agent printing out:

```

E0821 22:09:32.546640 1213 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

time="2024-08-21T22:09:32Z" level=fatal msg="run pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

E0821 22:09:38.578247 1220 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

E0821 22:09:44.605809 1227 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

time="2024-08-21T22:09:44Z" level=fatal msg="run pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Batch Image streaming failed to run a pod for containers.

Cleaning up Image streaming configs...

Task task/pr-1205-67d9f0ff-2-4afc09ac-aa0e-43420-group0-0/0/0 runnable 0 wait error: exit status 202

```

Other (identical) jobs succeed -- so I don't think there's anything permission wise to explain this ... Anybody seen this before / have an idea what to troubleshoot?

bolianyin

Do you have an estimate how many VMs are created your job and how many jobs you start around the same time? I wonder the pull request hit some quota limit.

matthew-liu-1

Has anyone else experienced this issue? Doesn't seem to be related to quota.

I've experienced this error as well specifically when cold-starting a single batch job (1 vCPU) - no batch jobs using this image in ~30 mins

Within 5 minutes after - I submit another batch job with the same container image URL and this error does not occur.

E0917 02:15:51.680145 1107 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

nomi

Cloud you share a batch job uid which failed recently due to this issue.

breathe

Apologies for extremely long delayed response ... I could not get to this when I first reported this issue -- and then this fell off my radar completely.

I just performed the experiment again to see if anything had changed and the problem still exists for me ...

> Do you have an estimate how many VMs are created your job and how many jobs you start around the same time? I wonder the pull request hit some quota limit.

I'm testing with a workload that creates 16 batch jobs each of which run 1 task per vm (so 16 vm's that are all started as soon as they can be). With that workload I see at least some failures for this reason 100% of the time with container streaming enabled ... (I haven't tested with smaller number of vm's to see if that causes the problem to disappear ...)

Some failing batch job-uids --
pr-1695-7ff98fe0-8-48b4682e-e8f4-435a0

pr-1695-44ae8ee8-0-9d2276e1-25de-41410

pr-1695-eb1df446-a-ee7a8435-b663-4d710

pr-1695-b847d7c-a1-e6b0d3e2-d6b5-42b30

pr-1695-ac40f2ec-2-e31dbbb8-1db9-40880

pr-1695-42377eba-1-b71f77f9-0afc-4afe0

Let me know if I can provide more info to help investigate this and I will try to reply in more timely fashion in the future!

nomi

Thank you for your reply. Do you mind sharing a recent job uid (within ~ a week) if it is available?

breathe

I will launch an experiment now and create a new one showing the problem. Will comment again in a few minutes. Thanks!

breathe

I launched 16 batch jobs all with the same os/docker image/other launch parameters (outside of the application parameters required by the application code)

13 failed (for this reason) and 3 completed

Here are job id's for 3 of the 13 that failed

pr-1773-c2d6be8b-6-ce596ec8-9d29-410d0

pr-1773-baf30103-5-3927f3a4-7e66-44bd0

pr-1773-a0ee2b24-8-2f2c43cf-0bdd-43340

And job id's for the 3 that didn't fail in case that might supply any addtl clues

pr-1773-b06b7069-e-b79e4a26-aedf-46490

pr-1773-4a7d88da-a-43db6f15-16b6-49610

pr-1773-b699439e-5-735cd887-7bb5-44970

All these jobs were launched in the same region (us-central1)

google batch job fails randomly when using enable_image_streaming