Batch - Scalability

Hi. Suppose I’m running jobs with ~1000 parallel tasks each. Roughly how many jobs can I run simultaneously without overriding the system in a project?

Solved Solved
0 6 464
1 ACCEPTED SOLUTION

Batch should be able to handle that. If you see any issue, please let us know.

View solution in original post

6 REPLIES 6

I did a small test. I ran a single job with multiple tasks. The job echos hello world.

On 10 tasks it took 1-2 minutes.

On 1000 tasks it took much longer (maybe 10 minutes, I stopped looking after 5 minutes).

I’m used to using HTCondor on gcp at work or using slurm-gcp in the past at home. This is significantly less scalable than those and this is a death knell for usage right now.

My use case on google cloud involves as many as hundreds of jobs running simultaneously with 1000 tasks each. Even on one job with 1000 tasks I’m seeing allocation time issues. On HTCondor scaling though a MIG I can scale to 100 jobs with 1000 tasks each in a few minutes. For one job with 1000 tasks it’s 2-3 minutes.

I expect 1000 tasks running in parallel will take around 2 mins as well. Do you see any quota issue?  If Batch could not allocate enough VMs to run the 1000 tasks, it might be very slow. You can describe the job and see if there is any such issues in job status with "gcloud batch jobs describe <job-id> --location <region>".

Hi. I found the issue. I was hitting a quota! Very sorry. Out of curiosity, would batch handle 100 jobs with 500 tasks each well, or would it be overwhelmed?

Thanks!

Batch should be able to handle that. If you see any issue, please let us know.

@JimmyPinks Note that neither the GCP Batch job list console, nor `gcloud batch jobs` can handle 1000's of existing jobs. 

The latency is VERY high with either method, and if you try to filter or sort the GCP Batch job list table, you will likely get the error "Sorry, the server was not able to fulfill your request."

Recent changes have been made to address the latency on the Job List page. The improvements will address the scenario where the Job List page will take several minutes to load or report "Sorry, the server was not able to fulfill your request.".

Please let us know if you are still experiencing the latency loading the Job List page. Thank you.