H100 (a3-highgpu) instances and LocalSSD - Page 2

thessjacob · 12-18-2024 01:38 PM

Are there any known bugs with mounting local SSDs on H100 (a3-highgpu family) instances? I create my batch jobs using the python SDK, and typically I can create a batch_v1.AllocationPolicy.Disk() object configured with type_="local-ssd" and size_gb=[whatever the size is of local ssd disks]. In the case of the a3-highgpu family, local-ssd is automatically provisioned, so I make the size_gb whatever is automatically provisioned when the instance spins up.

My problem is that for a3-highgpu specifically, the disks are attached as expected, but they are not mounted on my chosen mount point. With a similar node family like the a2-ultragpu (which also automatically attaches local-ssd disks), by simply creating the above AllocationPolicy.Disk() object, I can successfully get the instances to automatically raid the local-ssd drives and mount them on the expected mountpoint.

Is this something you've run into before? I've been comparing the outputs from some test runs, and the only difference I can see is that for the a3-highgpu family, the boot disk is automatically allocated as /dev/nvme0n1, whereas with the a2-ultragpu family, the boot disk is automatically allocated as /dev/sda. Maybe that is the problem, since local-ssd are all provisioned as nvmeXn1, and perhaps that is causing some small problem in the background?