Getting "Internal error occurred for the current a...

sj3014 · 02-20-2024 01:50 PM

Hi!

Currently, I am trying to train a tf model with Vertex AI Custom Training Service. The training seems working smoothly, but after about 72 hours, I get an error saying, "Internal error occurred for the current attempt," and the job restarts the training application. I have confirmed that it does train and saves a model (i put a code that saves a model if an accuracy goes up from the previous epoch).

Normally, if my training app has an error/bug, it does log an error, but in this case, it only logs the above message.

Previously, I was using a dedicated VM instance, and the training application ran with no issues. However, i decided to move onto Vertex AI Training and changed the code to save artifacts to GCS instead of a local VM.

I am quite stuck as there is no clue on why I am encountering this internal error. Any advice or help would be appreciated. Thank you!

{
"textPayload": "Internal error occurred for the current attempt.",
"insertId": "1jht5q9d24ef",
"resource": {
"type": "ml_job",
"labels": {
"task_name": "service",
"project_id": "vifive-cloud",
"job_id": "1229305951777980416"
}
},
"timestamp": "2024-02-19T14:14:16.741867419Z",
"severity": "ERROR",
"labels": {
"ml.googleapis.com/endpoint": ""
},
"logName": "projects/vifive-cloud/logs/ml.googleapis.com%2F1229305951777980416",
"receiveTimestamp": "2024-02-19T14:14:18.088843958Z"
}

lsolatorio

Hi @sj3014,

Welcome and thank you for reaching out to our community.

I understand that the "internal error" message you received is pretty generic and won't help much with troubleshooting.

Since you moved from a local VM to GCS, please check your Cloud Storage logs for any errors around the time the training job fails, look for entries related to "upload failure" or an "access denied" as they are most likely the ones that's causing the issue.

Aside from access concerns, resource limitations might also be the culprit, consider tracking your training metrics like losses, accuracy and utilization for any spikes or dips as it indicates resource limitations or code issues.

Here are some helpful references:

vonholst

I have very similar problems. My custom jobs runs for ~ 60 - 63 hours, then: "Internal error occurred for the current attempt." Followed by "Received SIGTERM: 15"

The custom job does not crash, and vertex runs the same job again. This however restarts the training for me.

How can I access info on what is causing the sigterm?

cameronplanet

@vonholst @sj3014 were either of you successful in determining the issue? I started running into this recently after upgrading my custom docker image to use a new version of tensorflow. I suspect it might be due to a GPU hardware issue but it's impossible to tell given the vague internal error reported by vertex 🙁

Getting "Internal error occurred for the current attempt" during Vertex AI Custom Training