Vertex AI - Regression - [exit-handler-1] failure

DmitryCh · 10-18-2023 04:14 PM

Hello,

Running a pretty simple regression in Vertex AI.

Constantly getting this error message:

The DAG failed because some tasks failed. The failed tasks are: [exit-handler-1].; Job (project_id = involuted-alpha-402321, job_id = 4517773981519970304) is failed due to the above error.; Failed to handle the job: {project_number = 536991238164, job_id = 4517773981519970304}

I saw other people struggling with that too:

https://www.googlecloudcommunity.com/gc/AI-ML/The-DAG-failed-because-some-tasks-failed-The-failed-ta...

Any ideas why this is happening and how to fix it?

P.S. Here is what probably led to it:

Job = tabular-stats-and-example-gen
"The replica workerpool0-0 exited with a non-zero status of 255."
But the problems started earlier with dso_loader:

{
"insertId": "1k2aqvyf1348nx",
"jsonPayload": {
"message": "2023-10-18 23:02:08.778415: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
"attrs": {
"tag": "workerpool0-0"
},
"levelname": "ERROR"
},
"resource": {
"type": "ml_job",
"labels": {
"project_id": "involuted-alpha-402321",
"job_id": "3343118774164258816",
"task_name": "workerpool0-0"
}
},
"timestamp": "2023-10-18T23:02:08.778662469Z",
"severity": "ERROR",
"labels": {
"compute.googleapis.com/resource_id": "5676508830588853617",
"ml.googleapis.com/trial_type": "",
"ml.googleapis.com/job_id/log_area": "root",
"compute.googleapis.com/zone": "us-central1-a",
"ml.googleapis.com/trial_id": "",
"compute.googleapis.com/resource_name": "cmle-training-10917961433705954519",
"ml.googleapis.com/tpu_worker_id": ""
},
"logName": "projects/involuted-alpha-402321/logs/workerpool0-0",
"receiveTimestamp": "2023-10-18T23:02:52.382179070Z"
}

Here is the error log >> https://drive.google.com/file/d/1GOZ7nYyUvs2bivBYVXZvqYqLQyL-sklo/

nceniza

The first error message posted appears to be a general error message. I look in to your logs, I noticed the instance: " Message: Exceeded limit 'QUOTA_FOR_INSTANCES' on resource 'resourcenameredacted'. Limit: 24.0" possibly causes are: you are hitting the region's quota limit instance, I would suggest to try different region for the meantime or try other machine type for the moment.