I currently have a training task that loads sharded CSV files from GCS using TorchData library (training code in Pytorch).
However I notice that my GPU usage has like ~ 2-3 minutes of 0% utilisation after each epochs, which I presume is due to I/O issues of streaming data from GCS and starving my GPU.
What's the most efficient way of getting around this? Would it be to download all my files from GCS to my compute instance, then loading the data directly?
Solved! Go to Solution.
Good day @shengy90,
Welcome to Google Cloud Community!
You can validate the following suggestions:
1. You try implementing WebDatasets. It shards and compiles multiple data files into POSIX tar archive files, it doesn't do any format conversion and the data format is the same in the tar file as it is on the disk, and it can be created with the tar command. WebDataset is a great way to achieve Sequential I/O since it will read the individual files in the tar file. This will be helpful since the data is collected in GCS which is in a remote setting, it will provide faster I/O of objects over the network and will reduce potential bottlenecks. You can check this blog post for more information regarding WebDatasets: https://cloud.google.com/blog/topics/developers-practitioners/scaling-deep-learning-workloads-pytorc...
2. Google Cloud Storage Fuse is used for accessing data on Cloud Storage for Vertex AI training. This will allow you to access Google Cloud Storage as a local file system which provides high throughput, by simply using the code:
file = open('/gcs/bucket-name/object-path', 'r')
You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#fuse
3. Also for best performance, your bucket must reside in the region where you are performing the custom training. You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#loading-data
4. You can also use this blog as a guide on how to efficiently train with vertex AI, you can check the demonstration using this link: https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai
Hope this helps!