Hi!
Currently, I am trying to train a tf model with Vertex AI Custom Training Service. The training seems working smoothly, but after about 72 hours, I get an error saying, "Internal error occurred for the current attempt," and the job restarts the training application. I have confirmed that it does train and saves a model (i put a code that saves a model if an accuracy goes up from the previous epoch).
Normally, if my training app has an error/bug, it does log an error, but in this case, it only logs the above message.
Previously, I was using a dedicated VM instance, and the training application ran with no issues. However, i decided to move onto Vertex AI Training and changed the code to save artifacts to GCS instead of a local VM.
I am quite stuck as there is no clue on why I am encountering this internal error. Any advice or help would be appreciated. Thank you!
{
"textPayload": "Internal error occurred for the current attempt.",
"insertId": "1jht5q9d24ef",
"resource": {
"type": "ml_job",
"labels": {
"task_name": "service",
"project_id": "vifive-cloud",
"job_id": "1229305951777980416"
}
},
"timestamp": "2024-02-19T14:14:16.741867419Z",
"severity": "ERROR",
"labels": {
"ml.googleapis.com/endpoint": ""
},
"logName": "projects/vifive-cloud/logs/ml.googleapis.com%2F1229305951777980416",
"receiveTimestamp": "2024-02-19T14:14:18.088843958Z"
}