Solved: What's the most efficient way to load data for tra...

shengy90 · 06-01-2023 07:15 AM

I currently have a training task that loads sharded CSV files from GCS using TorchData library (training code in Pytorch).

However I notice that my GPU usage has like ~ 2-3 minutes of 0% utilisation after each epochs, which I presume is due to I/O issues of streaming data from GCS and starving my GPU.

What's the most efficient way of getting around this? Would it be to download all my files from GCS to my compute instance, then loading the data directly?

kvandres

Good day @shengy90,

Welcome to Google Cloud Community!

You can validate the following suggestions:

1. You try implementing WebDatasets. It shards and compiles multiple data files into POSIX tar archive files, it doesn't do any format conversion and the data format is the same in the tar file as it is on the disk, and it can be created with the tar command. WebDataset is a great way to achieve Sequential I/O since it will read the individual files in the tar file. This will be helpful since the data is collected in GCS which is in a remote setting, it will provide faster I/O of objects over the network and will reduce potential bottlenecks. You can check this blog post for more information regarding WebDatasets: https://cloud.google.com/blog/topics/developers-practitioners/scaling-deep-learning-workloads-pytorc...

2. Google Cloud Storage Fuse is used for accessing data on Cloud Storage for Vertex AI training. This will allow you to access Google Cloud Storage as a local file system which provides high throughput, by simply using the code:

file = open('/gcs/bucket-name/object-path', 'r')

You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#fuse

3. Also for best performance, your bucket must reside in the region where you are performing the custom training. You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#loading-data

4. You can also use this blog as a guide on how to efficiently train with vertex AI, you can check the demonstration using this link: https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai

Hope this helps!

View solution in original post

kvandres

Good day @shengy90,

Welcome to Google Cloud Community!

You can validate the following suggestions:

1. You try implementing WebDatasets. It shards and compiles multiple data files into POSIX tar archive files, it doesn't do any format conversion and the data format is the same in the tar file as it is on the disk, and it can be created with the tar command. WebDataset is a great way to achieve Sequential I/O since it will read the individual files in the tar file. This will be helpful since the data is collected in GCS which is in a remote setting, it will provide faster I/O of objects over the network and will reduce potential bottlenecks. You can check this blog post for more information regarding WebDatasets: https://cloud.google.com/blog/topics/developers-practitioners/scaling-deep-learning-workloads-pytorc...

2. Google Cloud Storage Fuse is used for accessing data on Cloud Storage for Vertex AI training. This will allow you to access Google Cloud Storage as a local file system which provides high throughput, by simply using the code:

file = open('/gcs/bucket-name/object-path', 'r')

You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#fuse

3. Also for best performance, your bucket must reside in the region where you are performing the custom training. You can use this link to learn more: https://cloud.google.com/vertex-ai/docs/training/code-requirements#loading-data

4. You can also use this blog as a guide on how to efficiently train with vertex AI, you can check the demonstration using this link: https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai

Hope this helps!

shengy90

Thank you! I think that's what I'm looking for 👍

shengy90

Hi @kvandres

I'm just looking into WebDataset and something isn't quite clear to me. When I created a managed dataset via Vertex AI, Vertex AI clones my source data into a Vertex AI staging bucket and automatically shards all the files into small csv files (see screenshot below).

As far as I understood, WebDataset requires the files to be in .tar format, which is not compatible with what I'm seeing here. Would I need to:

1) Create an intermediate job to convert all these CSV files into tar format before using WebDataset?

2) Is there a way for Vertex AI to output these files as TAR files, instead of having them as CSV? From what I see in GCP's doc, that seems impossible as data must be in either CSV, JSON or AVRO?

3) Would I completely ignore managed dataset for now, and create my on data processing job to take my source data and package them as TAR files before calling my training job?

What's the most efficient way to load data for training?