AutoML Image classifier taking days for training

LouisPetrik · 02-16-2025 05:08 AM

I started to train a AutoML image classifier on Vertex AI. The data set is just about 36 images of dogs and flowers. I did this for purposes in my bachelor's thesis.
However, the model doesn't stop training. It is running for 11 days now, while it took only a couple of seconds to run a TensorFlow NN on these images on my machine. What can I do about it? I need to get done with this really quick, as I want to examine XAI features and model deployment on this.

MarvinLlamas

Hi @LouisPetrik,

Welcome to Google Cloud Community!

It looks like you are experiencing a significant issue with your AutoML image classification training job on Vertex AI. The training job has been running for an excessively long and unreasonable time (11+ days) with a very small dataset (36 images), despite having a limited budget of only 2 node hours. This indicates a problem with the training process, likely a bug or misconfiguration, preventing you from completing your thesis work involving XAI and model deployment.

Here are the potential ways that might help with your use case:

Stop the training Job: I suggest immediately stopping the training job to prevent you from further costs. Go to the Vertex AI console, navigate to the training pipelines, select the running pipeline (identified by the Pipeline ID in your screenshot), and cancel the job.
Setting up a budget alerts: Consider setting up budget alerts in your project to prevent unexpected cost overruns. You can configure these alerts to trigger at 50%, 75%, 90%, and 100% of your budget and receive email notifications to take immediate action.
Check Logs: After stopping the job, if possible, you may look for the model training logs. As it may help you to debug a pipeline in Vertex AI. It is possible that your model training crashed a long time ago, but Vertex AI is still reporting it as running.
Increase the number of Images: 36 images is a very small number. You might want to increase the number of images to help you with your training and address the problems of insufficient data splitting, overfitting, limited representation of variability, and unreliable model evaluation.

If you continue to run into issues, consider reaching out to Google Cloud Support to further check underlying issues. When you contact them, be sure to provide as much detail as possible and include screenshots. This will help them understand your problem better and get it sorted out more quickly.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.