When using GCP batch, I'm getting lots of "Unexpected EOF" errors when pulling the docker containers. I'm wondering if this is because my containers are 16-17GB, and maybe the machines don't have enough disk?
Hi @vedantroy-genmo,
Would you mind providing one failed example job with project id, job uid, and region so that we can help take a look?
Thanks!
Wenyan
Project ID: diffusion-trc
Job UID: tb-process-total-1-10b81a15-8248-42c80
Hi @vedantroy-genmo,
Looks like every task in your example failed with exit code 1. Could you try to run a job with only 1 task to see whether the issue is because the docker image pull meets unexpected EOF error that results in docker image downloading failure, then further results in the task failures due to image not exists?
In the meantime, Batch's default image is using default book disk size as 30GB, and your job is using VM with 16vCPU and 65GB memory, which should be enough. You can monitor your VM's disk usage with `df -h`, or set `installOpsAgent` field to be true in your job request (https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#instancepolicyortempla...), or check whether there are `no space left` related info in your VM logs. There are also a bunch of other reasons that might result in unexpected EOF, e.g. pool network connection.
Thanks!
Wenyan