Re: Google Batch Job immediately fails after Docke...

kasteph · 03-27-2023 12:26 PM

Hello,

I have a batch job that was triggered just fine until 2 weeks ago. Neither job definition nor Docker image has changed in functionality. gcloud batch job describe does not yield anything useful beyond what's already on the UI:

And from the cloud logs, the last log is: [Batch Action] Docker credential helper succeeded. before the status changes.

Here is the job definition:

{
    "taskGroups": [
        {
            "taskSpec": {
                "runnables": [
                    {
                        "container": {
                            "imageUri": "europe-west1-docker.pkg.dev/protocol-labs-data/pl-data/filet:v0.7.0",
                            "volumes": [
                                "/mnt/disks/snapshots/historical:/snapshots"
                            ],
                            "entrypoint": "/lily/export.sh",
                            "commands": [
                                "/snapshots/snapshot_2718720_2721610_1679986506.car.zst",
                                "/snapshots/out/"
                            ],
                            "options": "--privileged -e LILY_BLOCKSTORE_CACHE_SIZE=1400000 -e LILY_STATESTORE_CACHE_SIZE=1400000"
                        }
                    },
                    {
                        "script": {
                            "text": "ls /mnt/disks/share"
                        }
                    }
                ],
                "computeResource": {
                    "cpuMilli": 16000,
                    "memoryMib": 131072,
                    "boot_disk_mib": 500000
                },
                "volumes": [
                    {
                        "gcs": {
                            "remotePath": "test_snapshot_store"
                        },
                        "mountPath": "/mnt/disks/snapshots"
                    }
                ],
                "maxRetryCount": 1,
                "maxRunDuration": "259200s"
            },
            "taskCount": 1,
            "parallelism": 1
        }
    ],
    "allocationPolicy": {
        "instances": [
            {
                "policy": {
                    "machineType": "n2-highmem-16",
                    "provisioningModel": "STANDARD"
                }
            }
        ]
    },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

ligut14789

I have the same exact problem, I think it's related to the mounting of the bucket, because when I run scripts that do not require a gcp bucket, the job succeeds, but when I run jobs that were working just fine when I used them last time (also around 2 weeks ago) but that require storage volumes, they fail. (I even tried to run the hello-world-bucket.json from the gcp tutorials and it failed.)
Anyway, I hope somebody replies soon! (posting to follow the thread)

jacksonwb

Not sure if this is your issue, but jobs silently fail if any of the gcsfuse operations mounting the bucket fail. This means things will break in the `batch_agent` runnable responsible for the mounting. Some cases that caused issues for me were a misspelled bucket name, and the access permissions to the bucket for the batch job service account changing (I think full Storage Admin) is required.

ligut14789

I didn't change anything but now is working again, weird! I have been trying since last Friday, with no success

bolianyin

We had an issue if the GCS bucket has fine-grained access control, which should have been fixed now. It might be related.

kasteph

@bolianyinyes, the bucket has fine-grained access control. The job is still failing. Are you saying that I need to switch it to uniform access? If so, this wouldn't really be an option for us as we have only a sub path open to the public for reads and for specific service account to have create permissions on it.

kasteph

@jacksonwb thanks for the tip, I am actually using my principal to trigger the jobs (Owner permissions) so I'm not sure that's the issue. I've also looked at the batch agent logs and this is probably the only useful thing:

task action/STARTUP/0/0/group0 runnable 5 execution err: command failed with exitCode 1

kasteph

Just tried running the equivalent gcsfuse command in a compute instance and it worked fine without issues. Google Batch jobs are still failing.

edit:

after much trial and error, I've figured out that the issue is when mounting a GCS bucket with requester pays like:

                    {
                        "gcs": {
                            "remotePath": "fil-mainnet-archival-snapshots"
                        },
                        "mountPath": "/mnt/disks/snapshots",
                        "mountOptions": [
                            "--billing-project protocol-labs-data",
                            "--only-dir historical-exports"
                        ]
                    }

bolianyin

@kasteph : The issue was only for fine-grained access control and it should work now. Do you see other errors or everything works now with the requester pays options?

kasteph

Unfortunately still seeing issues. No meaningful errors from the batch_agent_logs because logs are directed to /dev/null, which @jacksonwb has pointed out in a separate trhead before. Basically still only getting this log:

task action/STARTUP/0/0/group0 runnable 5 execution err: command failed with exitCode 1

FWIW, I have moved around objects and started using uniform access level on the bucket and still the same issue. I suspect that the billing project flag is not being respected by gcsfuse.

bolianyin

This looks like unrelated to the fine-grained access issue.

Does the service account of the Batch job (Compute Engine default service account, by default) have permission to the billing project? You can try to run the same gcsfuse command to mount the bucket either in a script runnable in the same task or sshing into the VM and see if you see meaningful errors.

At the same time, we should look into exposing more relevant error messages without spamming too much the normal cases.

kasteph

What permission is needed for the default service account on the billing project? The Compute Engine default service account already has Service Usage Consumer on the project and the billing project itself is a Storage Admin. I'd imagine that that would be sufficient.

kasteph

@bolianyinI tried your suggestion and am running gcsfuse as a script.

The default google compute service account has the following permissions:

"resourcemanager.projects.createBillingAssignment",
"serviceusage.services.use"

I am mounting the bucket with gcsfuse like so:

gcsfuse --implicit-dirs --o allow_other --billing-project protocol-labs-data fil-mainnet-archival-snapshots /mnt/disks/snapshots

Here's the error I'm getting (which doesn't make sense considering the service account has the resource user and service user permissions):

daemonize.Run: readFromProcess: sub-process: mountWithArgs: mountWithConn: fs.NewServer: create file system: SetUpBucket: Error in iterating through objects: googleapi: Error 400: Bucket is a requester pays bucket but no user project provided., required

bolianyin

@kasteph Thanks for trying that and getting back to us.

This looks like a gcsfuse related issue. Is the bucket owned by the same project of the VM? I am not sure how you can provide a user project to gcsfuse which might be the VM project by default. I will try to find someone who is more familiar with gcs/gcsfuse to help.

bolianyin

@kasteph The issue is likely due to gcsfuse's go client library update and it is now tracked here: https://github.com/GoogleCloudPlatform/gcsfuse/pull/1050

Using "--enable-storage-client-library=false" as an additional option could be a workaround before they fix that bug. Do you mind trying that?

Google Batch Job immediately fails after Docker image is downloaded