Solved: Re: Datasets best practices

lucksp · 05-24-2023 12:51 PM

I am new to ML and VertexAI. I have some questions about an app I am building that requires image classification labels. The closest example I can think of is that mobile app which identifies plants, like PlantNet. You take a photo, and it returns the type of plant, ideally with a relationship from parent species.

I chose Vertex because it includes the Google Bucket storage, allows for custom labels, and having more than 1 label per image. I plan to have a single endpoint to query against, across all my data.

In terms of what should go inside of a dataset, are there best practices for setting up the datasets in Vertex? Should I put all images (categories) under the same dataset OR can I add multiple datasets that query across a single endpoint?

Should there be 1 dataset, or multiple?
- IE: a separate dataset for trees and a dataset for flowers?
  - in this case trees would include photos with labels "oak", "pine", "maple", and would include `none_of_these` label associated to things like "roses" and "poison ivy" and "grass"
- or a single large database that would include all the labels for all the things?
What about model deployment? How can I set a budget on that? It's darn pricey at 1.375 USD per hour
What about training hours? Is that a bit more ambiguous because it's based on the training output ratings?
- it's also pricey at 3.465 USD per hour

kvandres

Good day @lucksp,

Welcome to Google Cloud Community!

This will be depending on your use case, here are some suggestions for your questions:

1. You can use a single dataset with all the categories if the categories you're working with are closely related and you want your model to distinguish between them. In your example, you should use a single dataset with all of these labels if you want the model to differentiate the photos. This will enable the model to understand the variations among all the categories and produce more accurate predictions. If you wish to create two different datasets, I would suggest that if the tasks are not related to one another, then you can create two datasets with two different models. If you wish to sustain modularity and have various models for various tasks, this method may be helpful.

2. You can limit the number of compute nodes in your model settings during deployment but this will also limit its performance. You can check this link for more information: https://cloud.google.com/vertex-ai/docs/tutorials/image-recognition-automl/deploy-predict
You can also track this ongoing feature request of auto scaling to zero: https://issuetracker.google.com/206042974
You can check this link also if you want to learn more about the considerations when deploying a model: https://cloud.google.com/vertex-ai/docs/general/deployment

3. You can try configuring the training budget of the training but please note that the pricing is based on the node hour, and you can also enable the early stopping. Disabling early stopping will train the model until your training budget is exhausted. It is also important to know that this will still depend:

Model training can take many hours, depending on the size and complexity of your data and your training budget, if you specified one. You can use this link for more information: https://cloud.google.com/vertex-ai/docs/tabular-
data/forecasting/train-model

You can use this link for more information about the training pricing: https://cloud.google.com/vertex-ai/pricing#automl_models

For best practices on creating training data, this will be helpful to increase the quality of the model. you can check this link for more information: https://cloud.google.com/vertex-ai/docs/tabular-data/bp-tabular

Hope this helps!

View solution in original post

kvandres

Good day @lucksp,

Welcome to Google Cloud Community!

This will be depending on your use case, here are some suggestions for your questions:

1. You can use a single dataset with all the categories if the categories you're working with are closely related and you want your model to distinguish between them. In your example, you should use a single dataset with all of these labels if you want the model to differentiate the photos. This will enable the model to understand the variations among all the categories and produce more accurate predictions. If you wish to create two different datasets, I would suggest that if the tasks are not related to one another, then you can create two datasets with two different models. If you wish to sustain modularity and have various models for various tasks, this method may be helpful.

2. You can limit the number of compute nodes in your model settings during deployment but this will also limit its performance. You can check this link for more information: https://cloud.google.com/vertex-ai/docs/tutorials/image-recognition-automl/deploy-predict
You can also track this ongoing feature request of auto scaling to zero: https://issuetracker.google.com/206042974
You can check this link also if you want to learn more about the considerations when deploying a model: https://cloud.google.com/vertex-ai/docs/general/deployment

3. You can try configuring the training budget of the training but please note that the pricing is based on the node hour, and you can also enable the early stopping. Disabling early stopping will train the model until your training budget is exhausted. It is also important to know that this will still depend:

Model training can take many hours, depending on the size and complexity of your data and your training budget, if you specified one. You can use this link for more information: https://cloud.google.com/vertex-ai/docs/tabular-
data/forecasting/train-model

You can use this link for more information about the training pricing: https://cloud.google.com/vertex-ai/pricing#automl_models

For best practices on creating training data, this will be helpful to increase the quality of the model. you can check this link for more information: https://cloud.google.com/vertex-ai/docs/tabular-data/bp-tabular

Hope this helps!