Hi all,
I ran a job with 200+ tasks using a docker container runnable hosted on GCP's artifact registry. The job completed with all tasks succeeding except for 4 failed tasks. When looking at the batch_task_logs, I see the following exception listed 4 times (once for each of the failed tasks):
docker: Error response from daemon: Head "https://us-east4-docker.pkg.dev/v2/XXXX/XXXX/XXXX/manifests/latest ": denied: Permission "artifactregistry.repositories.downloadArtifacts" denied on resource "projects/XXXX/locations/us-east4/repositories/XXXX" (or it may not exist)."
The same docker image and service account are being used for all tasks. After re-running the job, all tasks succeed with no issue.
This makes me think we're running into some concurrency or rate-limiting issue accessing the artifact registry. Is there some sort of quota increase I need to make the artifact registry? Or could retries be configured on GCP's backend to try and grab the image multiple times if it fails? Our jobs are time critical and run as part of our production process so I'm hesitant to enable retries more broadly as it means if there's an application failure in a task (ie a bad code push), the task will retry multiple times on cases without any chance of succeeding and will delay our production pipelines.
Thanks in advance for the help!
Thanks for using Batch!
Google Artifact Registry has a default limit of 60k QPM per project per region. So if it's 200+ tasks, pulling 10-ish packages each within a minute, it will quite likely to be a quota issue. If so, can you try to request a quota increase following https://cloud.google.com/artifact-registry/quotas#request_a_quota_increase?
In the meantime, if you could provide your job UID to us, it would be easier for Batch to triage whether your tasks for the job has the high possibility to reach the GAR quota limit.
Thanks!
Hi Wenyhu,
Thanks for the reply!
Can you clarify what you mean by "package" when you say "10-ish packages each within a minute"? The way the job is configured, all tasks are using the same singular container. Does it count as an individual pull from the registry for each layer of the container's docker image?
Additionally, if Batch is pulling ~10 packages * 200 tasks, wouldn't that still only be 2,000 pulls from the Artifact Registry which is well below the 60k limit?
In the meantime, I'll go ahead and request a quota increase on the artifact registry pulls to see if that resolves the issue. Lastly, here's the job UID and region in the event that it's helpful for debugging/diagnosing:
Job UID: prodserver-constra-e6d727ed-6dab-4d090
Region: us-east4
Thank you again for all the help!
Hi @jfishbein,
Sorry to bring you confusion about the quota limit. With the job information you give, we find the 4 tasks that failed with 403 error is because the Google Artifact Registry did not receive authentication when we do `docker pull image` request.
Since it happened randomly, one suspicion is that there might be some unstable network issue with metadata server at that time. For more information about authentication, you can refer to https://cloud.google.com/compute/docs/access/authenticate-workloads
Batch should already provide retry on docker pulling images on our side, if not, please let us know!
In the same time, if the image download failed, the tasks running with docker command should fail with some exit code. You can refer to TaskSpec's LifecyclePolicy with TaskSpec's maxRetryCount to only retry tasks failed with those conditions: https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#taskspec.
Hope the above helps! Thanks!