How to avoid startup overhead for Cloud Batch jobs...

siddharthab · 07-06-2024 11:34 PM

I am using NextFlow with Cloud Batch, which is supposed to be an official integration. With the current design of Cloud Batch, each compute task in Nextflow is submitted as an independent job, which means each task incurs the VM scheduling and startup overhead. This can slow down a pipeline significantly if there are many small steps, each taking only a few seconds.

Ideally, such pipelines would benefit from persistent VMs, similar to what Vertex AI introduced recently. AWS Batch also has this notion through Job Queues, where jobs submitted to the same queue can share VMs, and AWS scales up these instances up or down based on the throughput of incoming job requests.

Are there plans for Cloud Batch to remove the startup overhead for individual jobs in a long queue? Moving from AWS, this is a clear pain point.

I suppose serious users can also consider Cloud HPC Toolkit instead of Cloud Batch.

bolianyin

Google Batch has a mechanism in private preview that supports reusing VMs across jobs. However, this requires additional integration work in NextFlow to take advantage of that. HPC Toolkit can use Cloud Batch as a backend and can already use the new feature.

pditommaso

Hi folks, is VM reuse not activated by the use of semantic labels? if this is the case, Nextflow already supports it by using the `process.resourceLabels` directive.

bolianyin

VM reuse is in private preview now. We plan to provide a more accessible way in the coming months.

siddharthab

Thank you @pditommaso and @bolianyin.

@bolianyin We would be interested in trying out this feature if it can be made available to us.
@pditommaso Happy to contribute, if needed, towards an integration effort.

How to avoid startup overhead for Cloud Batch jobs?