Solved: Re: disk quota exceed but can not see its used any...

JSeeker · 08-01-2023 07:19 PM

my quota is 4096 GB and its reached after I triggered the serverless spark job ( which failed) but its still shows me that its being used and I can not trigger another due to error "Insufficient 'DISKS_TOTAL_GB' quota. Requested 1200.0, available 46.0. "

I checked all "disks" sections under compute / baremetal etc... and I dont have such disk there.
all i have is 5 VM using about 100 GB total ( gcloud compute disks list shows the same )

Has anyone faced such issue ? anyway to resolve this? pls help!

ms4446

I understand your predicament. It seems like you've deleted all the batches but the disk space is still not freed up. Here are a few steps you can take:

Check for any active jobs or instances: Even though you've deleted all batches, there might be some active jobs or instances that are still running and consuming disk space. You can do this by navigating to the Cloud Console and checking the status of your jobs or instances.
Delete unused disks: If there are any unused disks, you can delete them to free up some space. You can do this by navigating to the 'Disks' section in the Cloud Console and deleting any disks that are not in use.
Check for snapshots: Sometimes, snapshots of your disks can also consume disk space. You can check for any snapshots and delete them if they are not needed.

As for the billing, you're right that increasing the quota might lead to additional costs. However, controlling the disk size using the spark.dataproc.driver.disk.size and spark.dataproc.executor.disk.size properties can help manage the costs. The minimum required might be 250GB total, but you can adjust these values based on your specific needs and budget.

View solution in original post

ms4446

The error you're encountering suggests that you've maxed out your DISKS_TOTAL_GB quota. This quota restricts the cumulative disk space that all your Dataproc Serverless batches can utilize in a specific region.

Despite only having 5 VMs consuming approximately 100 GB of disk space, you're still seeing this error because the DISKS_TOTAL_GB quota accounts for the disk space used by all your Dataproc Serverless batches, including those that have failed.

To address this, you have two options: you can either augment your DISKS_TOTAL_GB quota or adjust your Dataproc Serverless batches to consume less disk space.

To raise your DISKS_TOTAL_GB quota, visit the Cloud Console at https://console.cloud.google.com/, and go to the IAM & Admin > Quotas page. Here, choose the Persistent Disk Standard (GB) quota and hit the Edit button. In the Edit Quota dialog, enhance the Maximum value and then click the Save button.

To reduce the disk space used by your Dataproc Serverless batches, you can modify the spark.dataproc.driver.disk.size and spark.dataproc.executor.disk.size properties to a lower value. These properties determine the GB disk size allocated to the driver and executor nodes of your Dataproc Serverless batches.

For instance, if you adjust spark.dataproc.driver.disk.size to 100 and spark.dataproc.executor.disk.size to 50, each Dataproc Serverless batch will only consume 150 GB of disk space.

JSeeker

Thanks Ms4446 for the info.

the problem is I dont have a way to clean up those 4TB used disk space, and I can not figure out how to remove that usage ( I deleted all batches )

I would try to increase the quota along with the properties to control disk size ( i believe minimum needed is 250GB total ) but As I will be billed for that so I still need to figure out clean up!

ms4446

I understand your predicament. It seems like you've deleted all the batches but the disk space is still not freed up. Here are a few steps you can take:

Check for any active jobs or instances: Even though you've deleted all batches, there might be some active jobs or instances that are still running and consuming disk space. You can do this by navigating to the Cloud Console and checking the status of your jobs or instances.
Delete unused disks: If there are any unused disks, you can delete them to free up some space. You can do this by navigating to the 'Disks' section in the Cloud Console and deleting any disks that are not in use.
Check for snapshots: Sometimes, snapshots of your disks can also consume disk space. You can check for any snapshots and delete them if they are not needed.

As for the billing, you're right that increasing the quota might lead to additional costs. However, controlling the disk size using the spark.dataproc.driver.disk.size and spark.dataproc.executor.disk.size properties can help manage the costs. The minimum required might be 250GB total, but you can adjust these values based on your specific needs and budget.

JSeeker

Thanks for the possible area to look. but I already did that. I actually found that it is now gone and it basically took about 1 to 2 hours to update usage.

Thanks for the suggestion on the properties to use for controlling disk usage.

scjody

I'm experiencing this issue too. My quota is at its limit, and this is preventing GKE Autopilot from scheduling pods.

I have one disk shown in "gcloud compute disks list":

NAME          LOCATION      LOCATION_SCOPE SIZE_GB TYPE        STATUS
build-865110f us-central1-c zone           30      pd-standard READY

I have no snapshots (confirmed using "gcloud compute snapshots list").

It's possible that the disks were created by GKE Autopilot because I previously attempted to add a volume to one of my pods and that pod ended up in a "failed scale-up" loop, but I've deleted the cluster ("gcloud container clusters delete") where this was happening and the problem remains.

How can I tell what's using my Persistent Disk Standard quota, and how can I remove these resources?

ms4446

To address your issue with the Persistent Disk Standard quota in Google Cloud, especially in the context of GKE Autopilot, here are some steps you can take:

Review Disk Usage in GCP Console:
- Log into your Google Cloud Platform Console.
- Navigate to the "Compute Engine" section and then to "Disks" to review all the disks in your project.
- Ensure that you are checking the correct project and region, as quotas are often region-specific.
Check for Orphaned Disks:
- Sometimes, disks may not be properly deleted even after deleting clusters or VMs. Look for any disks that are not in use or are not attached to any active instances.
Review GKE Autopilot Resources:
- Navigate to the Kubernetes Engine section in the GCP Console.
- Check for any resources that might be using disks, such as Persistent Volume Claims (PVCs).
Use gcloud Command:
- Run gcloud compute disks list to list all disks in your project.
- Use gcloud compute disks describe [DISK_NAME] to get more details about a specific disk.
Check for Hidden Resources:
- Sometimes, resources like snapshots or other hidden entities might be using disk space. Ensure you have checked all possible sections where disk usage might be hidden.
Cleanup and Monitor:
- If you find any unused or orphaned resources, delete them to free up quota.
- Regularly monitor your disk usage and quotas to prevent similar issues in the future.
Review Quota Increase:
- If your current usage is legitimate and you need more resources, consider requesting a quota increase through the GCP Console.

scjody

Hi,

I checked all those things (disks, snapshots, GKE storage resources) in the us-central1 region (the only region where the quota was exceeded) when the problem was happening on Wednesday. The only disk shown was the 30 GB boot disk for "build-865110f" that you can see in my original message.

About 30 minutes after I posted my message, the quota started to drop, and was near 0 about 2 hours later.

I suspect that this is an issue with GKE Autopilot: I believe it repeatedly created volumes to support a pod that was in a failed scale-up loop. However there doesn't seem to be any way to see volumes that GKE Autopilot creates using "gcloud" or the cloud console. Is there a way for me to check for volumes created by GKE Autopilot if the problem happens again?

ms4446

he automatic scaling behavior of GKE Autopilot, coupled with limited visibility into volume creation, can indeed complicate troubleshooting.

While direct viewing of GKE Autopilot-created volumes via gcloud or the Console isn't straightforward, there are several approaches and workarounds you can employ:

Monitor Cluster Logs:
- Enable and closely monitor GKE Autopilot cluster logging. Look for log entries related to disk provisioning and volume creation, especially around the time your quota issue occurred.
- Utilize tools like Stackdriver to filter these logs for specific keywords such as "volume", "disk", "provisioning", or "autoscaler".
Leverage Cloud Monitoring:
- Set up Cloud Monitoring for your GKE Autopilot cluster, focusing on metrics related to disk usage and PVC creation. Watch for spikes that coincide with scaling activities.
- Employ anomaly detection in Cloud Monitoring to spot significant deviations from normal disk usage patterns.
Check for Orphaned Resources:
- Investigate for any orphaned PVCs or PVs that might still be linked to disks, even after deleting pods or clusters.
- Use commands like kubectl get pvcs --all-namespaces and kubectl describe pvc <pvc_name> to inspect all PVCs. Pay special attention to any PVCs that might be associated with the "build-865110f" disk or related to the scaling loop.
Consider External Tools:
- Explore third-party tools such as Velero or Keptn for more in-depth insights into GKE Autopilot's resource management. These tools might reveal volumes and PVCs not immediately visible through native GKE tools.
Engage Support:
- If these methods don't yield clear results, reaching out to Google Cloud Support is advisable. They have access to more detailed logs and telemetry data that can help identify the root cause of your quota issues.

angelcervera

@ms4446 I'm having the same problem but using Workflows and Batch. Using the web console, I can not find any disk. And executing `gcloud compute disks list` returns zero items.
I don't know what more to check and why this is happening.

I suppose that this happened after removing a batch that was in queue for hours, because getting ZONE_RESOURCE_POOL_EXHAUSTED errors.

ms4446

If you're encountering quota issues with no visible disks listed and suspect it might be related to Google Cloud Batch or Workflows, here are some steps to help identify and resolve the issue:

Step-by-Step Troubleshooting Guide

Check for Orphaned Batch Resources
- Inspect Batch Jobs:
  - Go to the Batch page in the Google Cloud Console.
  - Check if there are any jobs still running or in a failed state that might be holding resources.
- List Batch Jobs Using gcloud:
  - Use the following command to list all batch jobs:
    gcloud beta batch jobs list
  - Check the status of each job to ensure there are no lingering jobs consuming resources.
Check for Orphaned Workflow Resources
- Inspect Workflow Executions:
  - Go to the Workflows page in the Google Cloud Console.
  - Check if there are any workflows still running or in a failed state that might be holding resources.
- List Workflow Executions Using gcloud:
  - Use the following command to list all workflow executions:
    gcloud workflows executions list --workflow=<workflow-name>
  - Check the status of each execution to ensure there are no lingering executions consuming resources.
Inspect Resource Usage in Quotas Page
- Check Quota Usage:
  - Go to the Quotas page.
  - Filter by the relevant quota (e.g., Persistent Disk Standard (GB)).
  - Review the quota usage to see if it provides any clues about the resources consuming the quota.
Use Detailed Logging and Monitoring
- Enable Detailed Logging:
  - Enable detailed logging for your Batch and Workflow operations.
  - Navigate to the Logs Explorer and create queries to capture events related to resource allocation and release.
- Inspect Logs:
  - Use Logs Explorer to filter logs for keywords such as "disk", "volume", "provisioning", and "quota".
  - Example query:
    resource.type="gce_disk" logName="projects/<your-project-id>/logs/compute.googleapis.com%2Factivity_log" textPayload:("create" OR "delete" OR "provisioning")
Check for Hidden or Transient Resources
- Use gcloud to Check for Orphaned Resources:
  - Check for any lingering resources that might not be immediately visible in the console.
  - List all resources in the project:
    gcloud compute disks list gcloud compute instances list gcloud compute snapshots list
- Check for Transient Disk Usage:
  - Sometimes, disk usage might not be immediately visible if the disks are transiently created and deleted by services like Batch or Workflows. Enable and monitor detailed logs to catch such events.
Request Quota Increase Temporarily
- Request Quota Increase:
  - Go to the Quotas page.
  - Request a temporary increase in the relevant quota to unblock operations.

Example Commands and Queries

List all Batch Jobs:
```
gcloud beta batch jobs list
```

List Workflow Executions:

gcloud workflows executions list --workflow=<workflow-name>

Logs Explorer Query for Resource Events:
```
resource.type="gce_disk"
logName="projects/<your-project-id>/logs/compute.googleapis.com%2Factivity_log"
textPayload:("create" OR "delete" OR "provisioning")
```
If the issue persists and you cannot identify the source of the quota usage, contact Google Cloud Support. They have access to more detailed logs and telemetry data that can help identify hidden or orphaned resources.

disk quota exceed but can not see its used anywhere