I was not sure where to place this question. Please point me to another one if this is not appropriate here.
I'm trying to run some ML experiments using multiple GPUs through Vertex AI, and I need a reproducible environment. Specifically, I'd like the network connection between GPUs to be reproducible and as fast as possible.
I couldn't find any placement guarantees for Vertex AI, except that it runs in a single zone. Is there a way to restrict this behavior to a more reproducible environment? For example, one that ensures that all the GPUs are in the same rack.
When searching around, I learned that when we instantiate normal VMs (outside of Vertex AI), we can specify a compact placement policy and that this policy guarantees that the VMs are placed in "the same network infrastructure". What's the granularity of this placement? Is it at the server rack level?
I'm new to this cloud environment, but I hope I got the terminology correct. Please let me know if more clarification is required.
Thanks in advance.
All granularity of VM instance placement policies are on server rack level that is placed in a cluster in a data center. This includes Compact Placement Policy. You may refer to this VM instance placement policies documentation for more details.