I have a quick question: I'm running a batch job with a parallelism setting of 8, processing 100 tasks. I want to perform GPU profiling with our PyTorch code. While we can do this in a notebook with a single instance, the batch job runs multiple instances in parallel. Do you have any suggestions on how to approach GPU profiling in this setup?
What kind of GPU profiling you plan to do? If it is covered by Ops Agent, Batch works well with that through the installOpsAgent flag. Otherwise, you probably can run your own profiling code in a user task per VM.
We do not have SSH access, any alternatives to Ops Agent
Can you share more details on how you are executing the batch process? For example if you are using Google Cloud Batch you can pass a flag to automatically install the agent: https://cloud.google.com/batch/docs/create-run-job-ops-agent#create-job-auto-install-op-agent