GCP batch does not scale

I have ~20k existing completed/failed GCP batch jobs. For the GCP console, the Batch job list table takes ~5 minutes to load, and the table is not defaulted to most recent jobs displayed first. 

If I try to sort the job list table by "Date created" or try to filter the table (e.g., just failed jobs), I always get the error: "Sorry, the server was not able to fulfill your request." (after waiting many minutes). 

If I try to use `gcloud batch jobs`, such as:

```
gcloud batch jobs list --location=us-west1 --format="table[no-heading](jobId)" | wc -l
```

to simply list all jobs, this takes ~30 minutes. 

So, GCP Batch does not scale, which makes viewing job logs and troubleshooting failed jobs VERY difficult.

1 6 266
6 REPLIES 6

The management interface in GCP Console or `gcloud` definitely cannot handle more than ~1300 jobs. I have background cron jobs that run once every six hours to prune completed (succeeded and failed) jobs older than four days.

Even that doesn't ensure a responsive management experience. When we run 20k+ jobs in less than 24 hours, the experience is terrible no matter what.

Batch does execute our 20k+ jobs (each job has 100 - 1000 tasks) without any issues though.

My team runs Nextflow pipelines that can generate 10's of 1000's of jobs per run. We need a solution that actually scales, in regards to viewing the jobs -- at least the failed jobs. Otherwise, we cannot troubleshoot failed jobs.

Yep. Hence why you have to set up pubsub and then write a lot of boilerplate to track the states of every Job and Task. We do so to ensure that spot pre-empted tasks get retried.

@nick-youngblut does filtering on jobs in gcloud work for you? The latency issue is being addressed and is on track to roll out soon. We'll update these threads once it does, and you should see the improvement automatically within your projects.

Thanks @Shamel for the update! 

Job deletion via the python SDK or the big-query solution (export-to-bigquery-delete-batch-jobs) seems like a hack that only somewhat deals with the solution. 

For instance, one Nextflow pipeline run can generate 10's of 1000's of jobs in a matter of hours. How do I set the deletion CRON job? Even deleting jobs older than 1 day doesn't work, if >20k jobs were generated <1 day ago. 

The best that I've come up with for now is:

```bash
gcloud batch jobs list --location=${LOCATION} --sort-by="~createTime" --filter='Status.State="FAILED"' --limit ${N}
```

`--filter='Status.State="FAILED"'` was hard to figure out, given the lack of documentation. 

Also, `gcloud batch jobs describe` only provides the job details, but not the job logs. I've tried using `gcloud logging read` to get the job logs, but the command does not return anything:

```bash
gcloud logging read "resource.type=batch_job AND resource.labels.job_id='YOUR_JOB_ID'" --limit 50 --format "table(timestamp, textPayload)"
```


 

Recent changes have been made to address the latency on the Job List page. The improvements will address the scenario where the Job List page will take several minutes to load or report "Sorry, the server was not able to fulfill your request.".

Please let us know if you are still experiencing the latency loading the Job List page. Thank you.