Dear all,
I am trying to fine tune a model in Vertex AI using custom training option. During the execution, I had observed 3 issues.
1. The model is not stored in GCS despite giving the following options.
2. During the training when the model reports its progress as shown in the screenshot, it flags that line as Error though there is no error message shown. What's the error here? How do I understand this msg or resolve this? There are 100s of failures similar to this which don't appear to be truly a failure.
3. The job shows failed at the end without specifying the reason. Is it because of all these errors reported as shown in 2.
Please advise. Thanks for your help.
Suresh
Hi @sureshAZ,
Welcome to Google Cloud Community!
It seems that you are encountering an issue in your Vertex AI custom training job that fails silently, displays incorrect “ERROR” messages in the logs, and doesn’t save the model to Google Cloud Storage, probably due to a bug in your training script (run_module.py) that causes a crash before the model is saved.
Here are some potential ways to address your issue:
You can refer to the following documentation to understand the components of your setup and gather the necessary information for potential solutions to the issue.
I hope the above information is helpful.
I have exactly the same issue: every progress is shown as an error, but Vertex AI never points out what the error is. And my training can finish successfully and save the trained model successfully. How can I download the full job logs from VertexAI? @MarvinLlamas
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |