my quota is 4096 GB and its reached after I triggered the serverless spark job ( which failed) but its still shows me that its being used and I can not trigger another due to error "Insufficient 'DISKS_TOTAL_GB' quota. Requested 1200.0, available 46.0. "
I checked all "disks" sections under compute / baremetal etc... and I dont have such disk there.
all i have is 5 VM using about 100 GB total ( gcloud compute disks list shows the same )
Has anyone faced such issue ? anyway to resolve this? pls help!
Solved! Go to Solution.
I understand your predicament. It seems like you've deleted all the batches but the disk space is still not freed up. Here are a few steps you can take:
Check for any active jobs or instances: Even though you've deleted all batches, there might be some active jobs or instances that are still running and consuming disk space. You can do this by navigating to the Cloud Console and checking the status of your jobs or instances.
Delete unused disks: If there are any unused disks, you can delete them to free up some space. You can do this by navigating to the 'Disks' section in the Cloud Console and deleting any disks that are not in use.
Check for snapshots: Sometimes, snapshots of your disks can also consume disk space. You can check for any snapshots and delete them if they are not needed.
As for the billing, you're right that increasing the quota might lead to additional costs. However, controlling the disk size using the spark.dataproc.driver.disk.size
and spark.dataproc.executor.disk.size
properties can help manage the costs. The minimum required might be 250GB total, but you can adjust these values based on your specific needs and budget.
The error you're encountering suggests that you've maxed out your DISKS_TOTAL_GB quota. This quota restricts the cumulative disk space that all your Dataproc Serverless batches can utilize in a specific region.
Despite only having 5 VMs consuming approximately 100 GB of disk space, you're still seeing this error because the DISKS_TOTAL_GB quota accounts for the disk space used by all your Dataproc Serverless batches, including those that have failed.
To address this, you have two options: you can either augment your DISKS_TOTAL_GB quota or adjust your Dataproc Serverless batches to consume less disk space.
To raise your DISKS_TOTAL_GB quota, visit the Cloud Console at https://console.cloud.google.com/, and go to the IAM & Admin > Quotas page. Here, choose the Persistent Disk Standard (GB) quota and hit the Edit button. In the Edit Quota dialog, enhance the Maximum value and then click the Save button.
To reduce the disk space used by your Dataproc Serverless batches, you can modify the spark.dataproc.driver.disk.size and spark.dataproc.executor.disk.size properties to a lower value. These properties determine the GB disk size allocated to the driver and executor nodes of your Dataproc Serverless batches.
For instance, if you adjust spark.dataproc.driver.disk.size to 100 and spark.dataproc.executor.disk.size to 50, each Dataproc Serverless batch will only consume 150 GB of disk space.
Thanks Ms4446 for the info.
the problem is I dont have a way to clean up those 4TB used disk space, and I can not figure out how to remove that usage ( I deleted all batches )
I would try to increase the quota along with the properties to control disk size ( i believe minimum needed is 250GB total ) but As I will be billed for that so I still need to figure out clean up!
I understand your predicament. It seems like you've deleted all the batches but the disk space is still not freed up. Here are a few steps you can take:
Check for any active jobs or instances: Even though you've deleted all batches, there might be some active jobs or instances that are still running and consuming disk space. You can do this by navigating to the Cloud Console and checking the status of your jobs or instances.
Delete unused disks: If there are any unused disks, you can delete them to free up some space. You can do this by navigating to the 'Disks' section in the Cloud Console and deleting any disks that are not in use.
Check for snapshots: Sometimes, snapshots of your disks can also consume disk space. You can check for any snapshots and delete them if they are not needed.
As for the billing, you're right that increasing the quota might lead to additional costs. However, controlling the disk size using the spark.dataproc.driver.disk.size
and spark.dataproc.executor.disk.size
properties can help manage the costs. The minimum required might be 250GB total, but you can adjust these values based on your specific needs and budget.
Thanks for the possible area to look. but I already did that. I actually found that it is now gone and it basically took about 1 to 2 hours to update usage.
Thanks for the suggestion on the properties to use for controlling disk usage.
I'm experiencing this issue too. My quota is at its limit, and this is preventing GKE Autopilot from scheduling pods.
I have one disk shown in "gcloud compute disks list":
NAME LOCATION LOCATION_SCOPE SIZE_GB TYPE STATUS
build-865110f us-central1-c zone 30 pd-standard READY
I have no snapshots (confirmed using "gcloud compute snapshots list").
It's possible that the disks were created by GKE Autopilot because I previously attempted to add a volume to one of my pods and that pod ended up in a "failed scale-up" loop, but I've deleted the cluster ("gcloud container clusters delete") where this was happening and the problem remains.
How can I tell what's using my Persistent Disk Standard quota, and how can I remove these resources?
To address your issue with the Persistent Disk Standard quota in Google Cloud, especially in the context of GKE Autopilot, here are some steps you can take:
Review Disk Usage in GCP Console:
Check for Orphaned Disks:
Review GKE Autopilot Resources:
Use gcloud Command:
gcloud compute disks list
to list all disks in your project.gcloud compute disks describe [DISK_NAME]
to get more details about a specific disk.Check for Hidden Resources:
Cleanup and Monitor:
Review Quota Increase:
Hi,
I checked all those things (disks, snapshots, GKE storage resources) in the us-central1 region (the only region where the quota was exceeded) when the problem was happening on Wednesday. The only disk shown was the 30 GB boot disk for "build-865110f" that you can see in my original message.
About 30 minutes after I posted my message, the quota started to drop, and was near 0 about 2 hours later.
I suspect that this is an issue with GKE Autopilot: I believe it repeatedly created volumes to support a pod that was in a failed scale-up loop. However there doesn't seem to be any way to see volumes that GKE Autopilot creates using "gcloud" or the cloud console. Is there a way for me to check for volumes created by GKE Autopilot if the problem happens again?
he automatic scaling behavior of GKE Autopilot, coupled with limited visibility into volume creation, can indeed complicate troubleshooting.
While direct viewing of GKE Autopilot-created volumes via gcloud or the Console isn't straightforward, there are several approaches and workarounds you can employ:
Monitor Cluster Logs:
Leverage Cloud Monitoring:
Check for Orphaned Resources:
kubectl get pvcs --all-namespaces
and kubectl describe pvc <pvc_name>
to inspect all PVCs. Pay special attention to any PVCs that might be associated with the "build-865110f" disk or related to the scaling loop.Consider External Tools:
Engage Support:
@ms4446 I'm having the same problem but using Workflows and Batch. Using the web console, I can not find any disk. And executing `gcloud compute disks list` returns zero items.
I don't know what more to check and why this is happening.
I suppose that this happened after removing a batch that was in queue for hours, because getting ZONE_RESOURCE_POOL_EXHAUSTED errors.
If you're encountering quota issues with no visible disks listed and suspect it might be related to Google Cloud Batch or Workflows, here are some steps to help identify and resolve the issue:
Step-by-Step Troubleshooting Guide
Check for Orphaned Batch Resources
Use the following command to list all batch jobs:
gcloud beta batch jobs list
Check the status of each job to ensure there are no lingering jobs consuming resources.
Check for Orphaned Workflow Resources
Use the following command to list all workflow executions:
gcloud workflows executions list --workflow=<workflow-name>
Check the status of each execution to ensure there are no lingering executions consuming resources.
Inspect Resource Usage in Quotas Page
Use Detailed Logging and Monitoring
Use Logs Explorer to filter logs for keywords such as "disk", "volume", "provisioning", and "quota".
Example query:
resource.type="gce_disk"
logName="projects/<your-project-id>/logs/compute.googleapis.com%2Factivity_log"
textPayload:("create" OR "delete" OR "provisioning")
Check for Hidden or Transient Resources
Use gcloud to Check for Orphaned Resources:
Check for any lingering resources that might not be immediately visible in the console.
List all resources in the project:
gcloud compute disks list
gcloud compute instances list
gcloud compute snapshots list
Check for Transient Disk Usage:
Request Quota Increase Temporarily
Example Commands and Queries
List all Batch Jobs:
gcloud beta batch jobs list
List Workflow Executions:
gcloud workflows executions list --workflow=<workflow-name>
Logs Explorer Query for Resource Events:
resource.type="gce_disk"
logName="projects/<your-project-id>/logs/compute.googleapis.com%2Factivity_log"
textPayload:("create" OR "delete" OR "provisioning")