Solved: Re: Very strange behavior lately... logged status ...

rpurdom · 08-29-2024 06:53 PM

Have a look at picture below. There are several pages of log messages from Google Batch housekeeping to setup the environment showing up as errors in the log. This particular job has 24 parallel tasks, each of the tasks has log messages from housekeeping showing up as errors. The vast majority of messages, including all of the tasks log messages, are showing up correctly.

I've also been experiencing failed pulls from Debian repositories causing one of the tasks to fail before even getting to the work process and instances where a task writes a small state file to GCS but the file never actually lands in GCS. Despite the write operation and close succeeding on the file.

Common messages are reported as errors

Robert

bolianyin

@rpurdom:

the errors revealed an issue and we should be able to fix it soon.

wrt c3d and t2d machine types, I couldn't think of reasons other than the particular machine types, disks, etc resources are exhausted for c3d. If you can create a new job that reproduce the issue, we can look into further.

wrt parallel running task. The default behavior in Batch is that tasks will run as long as there is at least one VM is created. There is a ["barrier"](https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#barrier) runnable, which can be used as a way to synchronize tasks. Depends on your use case, it may or may not help. We have plans to implement "gang scheduling", but no specific date yet.

View solution in original post

bolianyin

This looks like info log written to stderr, which Batch will treated as error. Do you see a behavior change or it has been always like this?

rpurdom

The log errors started on August 29th and are in every log since that day and none of the logs before that day that I checked (~10).

I also resolved the job mysteriously terminating. Turns out that at start up, Google Batch was failing on a GCE Instance Insert with a ZONE_RESOURCE_POOL_EXHAUSTED. I tried reducing the total number of CPU's for the job from 384 to 192 and still got the pool exhausted. When I specified an older Epyc Milan processor, everything started working. We've been using the Genoa CPU's without problem since they were rolled out, so not sure what's changed, but we have heavily optimized our code with AVX and get a nice boost in performance from Genoa processors. I think if Batch cannot start a requested node, the job shouldn't run. Pool exhausted should be a fatal error to a batch job. As it is, Batch doesn't report anything to the job, it just continues to run and our app runs for a while and then times out when trying to synchronize statistics between the nodes.

Robert

bolianyin

Hi @rpurdom ,

Could you share the region and job UID (or an identifiable substr of the job UID if you prefer) so that we can look into the log error issue? We are not aware of any change which may cause this. We think those logs are written to stderr instead of stdout, which generated the "ERROR" severity.

Do you mean when you specify min_cpu_platform as "AMD Genoa" the job runs even with "ZONE_RESOURCE_POOL_EXHAUSTED" error? How many VMs does the job use? I agree that job should not run if no VM is created. But if at least one VM is created, the job will still run.

rpurdom

region: us-central1

job UID: sept-batch1-6bb1cdd0-50d2-420e-b514-70 (with errors)

job UID: december-internal-5b37e98c-fc96-4d28-0 (before aug 29)

We run many varieties of jobs, but the one that is problematic right now runs 24 parallel nodes and typically with 16 processors and 64GB RAM. These parameters can be varied depending on the workload, we can run 192 nodes with 30 processors, but we haven't needed that power yet, but it's coming.

In the past and for the last year or so, we specified the machine by referring to an instance template setup in the project and specified in the batch json with "instanceTemplate": "genoa-16-64". But a couple of weeks ago, that stopped working and we started running by specifying the machine in the job json by specifying machineType": "c3d-standard-16". When I run a job with 24 parallel tasks and "C3d-standard-*" (* meaning insert 4,8,16,30) we get the ZONE_RESOURCE_POOL_EXHAUSTED most of the time and it doesn't matter if I specify less CPUs. When we set machineType to "t2d-standard-*", it works and we don't get pool exhausted. I don't have a log with this error (we delete batch jobs that failed), but If you need one, I can run a job to generate the ZONE_RESOURCE_POOL_EXHAUSTED.

This job works by splitting the data between parallel nodes each working on their chunk of data and periodically shares its state with the other nodes so that every node can compute stats based on 100%. If one node doesn't start, nothing can work because all nodes rely on the value of "TASK_COUNT" and the tasks are running in parallel, otherwise everything deadlocks. So if one task fails on insert, then the only way the job knows is that it times out at the first synchronization point and the job terminates. We've put in the dev pipeline a code change to add synchronization as very first step in every job type.

So, long story short, if a vm doesn't start, it's catastrophic for our job and I would suspect virtually every job that specifies parallel > 1 will fail if Batch fails to start "parallel" VM's. "parallel" is a requirement for the job to run, not a suggestion.

Robert

bolianyin

@rpurdom:

the errors revealed an issue and we should be able to fix it soon.

wrt c3d and t2d machine types, I couldn't think of reasons other than the particular machine types, disks, etc resources are exhausted for c3d. If you can create a new job that reproduce the issue, we can look into further.

wrt parallel running task. The default behavior in Batch is that tasks will run as long as there is at least one VM is created. There is a ["barrier"](https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#barrier) runnable, which can be used as a way to synchronize tasks. Depends on your use case, it may or may not help. We have plans to implement "gang scheduling", but no specific date yet.

rpurdom

I can verify that the errors are gone but I'm also now able to use c3d instances again with no POOL_EXHAUSTED issues. Related or just more free resources?

Robert

bolianyin

Thanks for confirming. They are separate issues. The pool exhausted errors are probably due to temporary capacity issues.

Very strange behavior lately... logged status messages from Google Batch flagged as errors