Internal error occurred on vertex autoML training

jphilippi_trust

the vertex autoML training process send me the message: "

Training pipeline failed with error message: Internal error occurred. Please retry in a few minutes. If you still experience errors, contact Vertex AI." when the training is finish, y try with different dataset and training process and always get the same message.

MarvinLlamas

Hi @jphilippi_trust,

Welcome to the Google Cloud Community!

It looks like you are encountering repeated Vertex AI AutoML training failures with a vague “Internal error,” even though your actual training time (like 4 hr 23 min) exceeds your allocated budget (such as 2 node hours). This mismatch suggests your job may be running out of resources or time during the final stages, causing Vertex AI to return a generic error instead of clearly indicating that the budget was exceeded.

Here are the potential ways that might help with your use case:

Review the Google Cloud Status Dashboard Action: You may want to check the Google Cloud Status Dashboard to see if there are any ongoing incidents in your region (like us-central1) that could be affecting Vertex AI, Cloud Storage, or related services. Even if AutoML isn't specifically mentioned, broader service issues can sometimes impact its performance.
Navigating Cloud Logging: You may want to take a close look at your detailed log entries to spot specific error messages, stack traces, or other hints about what went wrong during your model's export or evaluation phase. These deeper logs usually offer much clearer insights than the general error messages shown in your UI.
Significantly Increase the Training Budget: When starting a new training run, increase your budget to 24 or 48 node hours. You’re only billed for the compute time you use, so your training and AutoML Edge post-processing can finish without interruption.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

jphilippi_trust

hi Marvin, thanks for your response. I check the logs, but not found anything usefull, only a log of the start of the training process. I will try your third comment and increase the max time for training. when the training is finish i will send a update on this ticket.

jphilippi_trust

i try to train a no egde model with the same dataset and have a success, but i have the same problem with egde models, i increase the time of training to 12 hours, but the problems is the same.