RuntimeError: Job failed with:
code: 9
message: "The DAG failed because some tasks failed. The failed tasks are: [get-wine-data].; Job (project_id = alpine-flare-414217, job_id = 6712692485287575552) is failed due to the above error.; Failed to handle the job: {project_number = 829485284325, job_id = 6712692485287575552}"
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_cpus, cause=null; Failed to create custom job for the task. Task: Project number: 829485284325, Job id: 6712692485287575552, Task id: 3645594697843343360, Task name: get-wine-data, Task state: DRIVER_SUCCEEDED, Execution name: projects/829485284325/locations/us-central1/metadataStores/default/executions/6942316102214029031; Failed to create external task or refresh its state. Task:Project number: 829485284325, Job id: 6712692485287575552, Task id: 3645594697843343360, Task name: get-wine-data, Task state: DRIVER_SUCCEEDED, Execution name: projects/829485284325/locations/us-central1/metadataStores/default/executions/6942316102214029031; Failed to handle the pipeline task. Task: Project number: 829485284325, Job id: 6712692485287575552, Task id: 3645594697843343360, Task name: get-wine-data, Task state: DRIVER_SUCCEEDED, Execution name: projects/829485284325/locations/us-central1/metadataStores/default/executions/6942316102214029031
The error message indicates that the job failed due to resource exhaustion. Specifically, it seems that the quota limits for custom model training CPUs have been exceeded.
To resolve this issue, you can try the following steps:
Check Quota Limits: Verify the current quota limits for custom model training CPUs in your Google Cloud project. If the limits are too low, you may need to request a quota increase from Google Cloud Console.
Optimize Resource Usage: Review your job configuration and optimize it to use fewer resources if possible. This could involve adjusting the number of CPUs requested for training or optimizing the code to be more efficient.
Retry the Job: Once you've addressed any potential issues with quota limits or resource usage, you can retry the job to see if it succeeds.
Monitor Resource Usage: Continuously monitor the resource usage of your jobs to ensure they stay within the allocated quota limits.
Contact Support: If you're still encountering issues or need assistance with quota increases, you can contact Google Cloud support for further assistance.
By following these steps, you should be able to address the resource exhaustion issue and successfully run your job.
Deepali, I am also getting the same error. Not sure why, since I see 8 cpus limits in quota and running a very basic pipeline that should not ask for more than one cpu.
Is the issue resolved with you, could you tell me what went wrong in your case?