Hi Google Cloud Community,
I'm encountering a critical issue while training a custom Vision model on Vertex AI and would appreciate any guidance to resolve it.
I’m using a dataset of ~140,000 images (format: JPEG, size: 1024x1024) stored in a Google Cloud Storage bucket.
When initiating the training job via the Vertex AI Console, the task starts normally but consistently fails after ~30 seconds.
The error occurs at the same point every time, regardless of retries. Screenshot of the error message:
Any suggestions for debugging the abrupt failure?
Hi @Kagirohi,
Welcome to Google Cloud Community!
Vertex AI Vision model training job failing consistently after 30 seconds with a large dataset of 140,000 images strongly suggests a resource bottleneck or a configuration issue.
Here are some approaches that you may try:
If the issue persists, contact Google Cloud Support. They have better visibility into the underlying system and can assist you with specific issues.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |