Re: Profiling batch job with multiple instances

shivamehta · 07-16-2024 03:32 AM

I have a quick question: I'm running a batch job with a parallelism setting of 8, processing 100 tasks. I want to perform GPU profiling with our PyTorch code. While we can do this in a notebook with a single instance, the batch job runs multiple instances in parallel. Do you have any suggestions on how to approach GPU profiling in this setup?

bolianyin

What kind of GPU profiling you plan to do? If it is covered by Ops Agent, Batch works well with that through the installOpsAgent flag. Otherwise, you probably can run your own profiling code in a user task per VM.

shivamehta

We do not have SSH access, any alternatives to Ops Agent

alexmoore

Can you share more details on how you are executing the batch process? For example if you are using Google Cloud Batch you can pass a flag to automatically install the agent: https://cloud.google.com/batch/docs/create-run-job-ops-agent#create-job-auto-install-op-agent