Hello,
Running a pretty simple regression in Vertex AI.
Constantly getting this error message:
The DAG failed because some tasks failed. The failed tasks are: [exit-handler-1].; Job (project_id = involuted-alpha-402321, job_id = 4517773981519970304) is failed due to the above error.; Failed to handle the job: {project_number = 536991238164, job_id = 4517773981519970304}
I saw other people struggling with that too:
Any ideas why this is happening and how to fix it?
P.S. Here is what probably led to it:
{
"insertId": "1k2aqvyf1348nx",
"jsonPayload": {
"message": "2023-10-18 23:02:08.778415: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
"attrs": {
"tag": "workerpool0-0"
},
"levelname": "ERROR"
},
"resource": {
"type": "ml_job",
"labels": {
"project_id": "involuted-alpha-402321",
"job_id": "3343118774164258816",
"task_name": "workerpool0-0"
}
},
"timestamp": "2023-10-18T23:02:08.778662469Z",
"severity": "ERROR",
"labels": {
"compute.googleapis.com/resource_id": "5676508830588853617",
"ml.googleapis.com/trial_type": "",
"ml.googleapis.com/job_id/log_area": "root",
"compute.googleapis.com/zone": "us-central1-a",
"ml.googleapis.com/trial_id": "",
"compute.googleapis.com/resource_name": "cmle-training-10917961433705954519",
"ml.googleapis.com/tpu_worker_id": ""
},
"logName": "projects/involuted-alpha-402321/logs/workerpool0-0",
"receiveTimestamp": "2023-10-18T23:02:52.382179070Z"
}
The first error message posted appears to be a general error message. I look in to your logs, I noticed the instance: " Message: Exceeded limit 'QUOTA_FOR_INSTANCES' on resource 'resourcenameredacted'. Limit: 24.0" possibly causes are: you are hitting the region's quota limit instance, I would suggest to try different region for the meantime or try other machine type for the moment.