GCP batch does not scale

nick-youngblut · 12-19-2023 02:59 PM

I have ~20k existing completed/failed GCP batch jobs. For the GCP console, the Batch job list table takes ~5 minutes to load, and the table is not defaulted to most recent jobs displayed first.

If I try to sort the job list table by "Date created" or try to filter the table (e.g., just failed jobs), I always get the error: "Sorry, the server was not able to fulfill your request." (after waiting many minutes).

If I try to use `gcloud batch jobs`, such as:

```
gcloud batch jobs list --location=us-west1 --format="table[no-heading](jobId)" | wc -l
```

to simply list all jobs, this takes ~30 minutes.

So, GCP Batch does not scale, which makes viewing job logs and troubleshooting failed jobs VERY difficult.

atlasai-bborie

The management interface in GCP Console or `gcloud` definitely cannot handle more than ~1300 jobs. I have background cron jobs that run once every six hours to prune completed (succeeded and failed) jobs older than four days.

Even that doesn't ensure a responsive management experience. When we run 20k+ jobs in less than 24 hours, the experience is terrible no matter what.

Batch does execute our 20k+ jobs (each job has 100 - 1000 tasks) without any issues though.

nick-youngblut

My team runs Nextflow pipelines that can generate 10's of 1000's of jobs per run. We need a solution that actually scales, in regards to viewing the jobs -- at least the failed jobs. Otherwise, we cannot troubleshoot failed jobs.

atlasai-bborie

Yep. Hence why you have to set up pubsub and then write a lot of boilerplate to track the states of every Job and Task. We do so to ensure that spot pre-empted tasks get retried.

Shamel

@nick-youngblut does filtering on jobs in gcloud work for you? The latency issue is being addressed and is on track to roll out soon. We'll update these threads once it does, and you should see the improvement automatically within your projects.

nick-youngblut

Thanks @Shamel for the update!

Job deletion via the python SDK or the big-query solution (export-to-bigquery-delete-batch-jobs) seems like a hack that only somewhat deals with the solution.

For instance, one Nextflow pipeline run can generate 10's of 1000's of jobs in a matter of hours. How do I set the deletion CRON job? Even deleting jobs older than 1 day doesn't work, if >20k jobs were generated <1 day ago.

The best that I've come up with for now is:

```bash
gcloud batch jobs list --location=${LOCATION} --sort-by="~createTime" --filter='Status.State="FAILED"' --limit ${N}
```

`--filter='Status.State="FAILED"'` was hard to figure out, given the lack of documentation.

Also, `gcloud batch jobs describe` only provides the job details, but not the job logs. I've tried using `gcloud logging read` to get the job logs, but the command does not return anything:

```bash
gcloud logging read "resource.type=batch_job AND resource.labels.job_id='YOUR_JOB_ID'" --limit 50 --format "table(timestamp, textPayload)"
```

Shamel

Recent changes have been made to address the latency on the Job List page. The improvements will address the scenario where the Job List page will take several minutes to load or report "Sorry, the server was not able to fulfill your request.".

Please let us know if you are still experiencing the latency loading the Job List page. Thank you.

nick-youngblut

GCP Batch monitoring still does not scale: if hundreds of jobs are running, the console only shows a subset (~50) if using `state:Running` as a filter. Also, "deleting" those jobs generally doesn't work, and re-running `state:Running` shows the same jobs again.

If I instead use `gcloud` via:
```
gcloud batch jobs list --project MY_PROJECT_ID --filter="state:RUNNING"
```

The `gcloud` job takes 10's of minutes to complete.

How is one supposed to monitor (and possibly cancel) hundreds of running batch jobs? Both the GCP console and `gcloud` are ineffective.

bolianyin

gcloud batch is slow when there are lots of jobs as it filters through all jobs. But console does it different and should be much faster. The console by default show a page size of 50, but you can change it up to 200 for example, and you should be able to page through all filtered jobs.

nick-youngblut

At least for me, `gcloud batch jobs list --project MY_PROJECT_ID --filter="state:RUNNING"` takes 20-30 minutes. The number of running jobs is ~100, and the total number of jobs per day is 1-2k.

bolianyin

20 mins is unusually long for 2k jobs, we can look into this. At the same time, hope the Console works for you.

nick-youngblut

The console is definitely faster than gcloud. It would be great if I could use partial match filtering and/regex to filter job names.

vaslem

Hi, we are facing the same problem. Nextflow pipelines produce a lot of jobs per execution. The waiting times suddenly started being prohibitively high after passing a threshold of ~25k jobs to have this command running in a script. The script we originally used was

gcloud batch jobs list '--filter=status.state=RUNNING OR status.state=SCHEDULED'

and for troubleshooting we adapted to also include creation time:

jobs=`gcloud batch jobs list "--filter=(createTime>-P1W AND (status.state=RUNNING OR status.state=SCHEDULED))"`

This did not decrease the waiting time. Also, there is no way to batch delete jobs AFAIK, only by running an iteration over a gcloud batch jobs list return.

bolianyin

Listing a large number of jobs with filters has a unknown performance issue that unfortunately will take longer time to improve.

Is it possible to delete finished jobs to improve the situation while we are working on the improvement?

nick-youngblut

We then cannot see past jobs, especially ones to troubleshoot due to errors.

nick-youngblut

@bolianyin it appears that this issue still has not been fixed. A sample query of `status.state="RUNNING" AND create_time>"2025-06-29T14:27:48Z"` using `

batch_v1.ListJobsRequest` in the python sdk requires > 10 MINUTES (not seconds) to return a result.