GCP Batch machine type and resource best practice

gradientopt · 05-07-2024 09:27 AM

Hi Team,

Let' s say I have 1e4 tasks to run and each task requires 1CPU and 1GB memory, should I specify the machine type to have option1 = 2cpu and 2gb memory (each machine run 2 tasks at a time) or should I specify it to have option2 = 16cpu and 16gb memory (each machine run 16 tasks at a time)? Which one would be the fastest and cheapest way to get things done?

My prior is the following:

1. running_time(option_1)<run_time(option_2). option_2 machine is more rare, making it harder to be allocated faster.

2. cost(option_1)>cost(option_2) The memory cost and cpu cost of two options should roughly be the same if the run_time is roughly the same(depending on answer for quesiton #1) but option2 could help reduce the number of vm instances so that the total bootdisk cost and ip address cost etc would be much lower.

Could the team correct me if I am wrong? Thanks!

wenyhu

Hi @gradientopt,

I agree with your most thinking of trade offs. For job running time only, I would recommend you use smaller machine type with most parallelism. Larger machine type would be slightly cheaper considering the overall CPU/RAM you need in total. I would also recommend you thinking of other resources your need (if any), how large they are, how ofter they are used, e.g. additional PD or SSD attached, GPU, etc. For example, if you also need to run GPU job, the GPU's availability and price might be a lot more important than the VM itself.

Thanks,

Wenyan

gradientopt

Thanks for the reply! I do not fully understand why "smaller machine type would lead to most parallelism". Shouldn't they have the same parallelism? Let' say I have 1e4 tasks that each requires 1cpu and 1gb. If I use 1cpu 1gb machine, 1e4 machines will be allocated and 1e4 tasks run in parallel. If I use 8cpu 8gb machine, then 1250 machines get allocated and still 1e4 tasks run in parallel?

wenyhu

Hi @gradientopt, to accurate my reply, my `with most parallelism` just means you don't specify extra parallelism limitation in the job field, that does not indicate the smaller VM will have more task parallelism (although more VM counts). By default for the 2 cases you mentioned, the parallelism is the same, you are totally right. Sorry to bring you confusion!

gradientopt

Got it, no worries! To summarize, the only downside, if any, of specifying a larger machine for the above case is it might be slower to allocate such machines？ Causing declay in reaching maximal parallelism?

wenyhu

Hi @gradientopt, in general yes, larger machine would be slightly slower on allocation and spinning up.