Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex AI Vision Model Training Fails After 30 Seconds

Hi Google Cloud Community,

I'm encountering a critical issue while training a custom Vision model on Vertex AI and would appreciate any guidance to resolve it.

I’m using a dataset of ~140,000 images (format: JPEG, size: 1024x1024) stored in a Google Cloud Storage bucket.

When initiating the training job via the Vertex AI Console, the task starts normally but consistently fails after ~30 seconds.

The error occurs at the same point every time, regardless of retries. Screenshot of the error message:Kagirohi_1-1741681727520.png

Any suggestions for debugging the abrupt failure?

0 1 79
1 REPLY 1

Hi @Kagirohi,

Welcome to Google Cloud Community!

Vertex AI Vision model training job failing consistently after 30 seconds with a large dataset of 140,000 images strongly suggests a resource bottleneck or a configuration issue. 

Here are some approaches that you may try:

  • Inspect the Cloud Logging for the training job after a failure. This will likely give you the most direct clue.
  • Validate images by running Python script to check for corrupted images in your dataset.
  • Ensure that your Google Cloud project has sufficient quotas for the resources required by your training job. If quotas are exceeded, the job might fail shortly after starting.
  • Increase compute resources. Try a larger machine type with GPU/TPU acceleration. This is often the solution for large datasets.
  • Check the configuration of the worker machines (e.g., machine type, accelerators). Misconfigured resources can lead to job failures.
  • Verify that the Google Cloud Storage bucket containing your dataset is accessible to the Vertex AI service account. Ensure the correct permissions are set.
  • If you're using custom training code, there might be an issue causing the job to terminate. Test the code locally or in a Vertex AI Workbench to identify potential problems.
  • Experiment with a smaller batch size.
  • Test with a small, representative subset of your data to isolate data-related issues.

If the issue persists, contact Google Cloud Support. They have better visibility into the underlying system and can assist you with specific issues.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.