Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GKE autopilot with architect arm64 can not schedule "GCE quota exceeded"

Hi EveryOne,

I have an issue related to a GKE Autopilot cluster:

Currently, I have this deployment:

"""apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- env:
- name: RABBITMQ_HOST
value: 10.250.0.5
- name: QUEUE_NAME
value: testing
image: asia-southeast1-docker.pkg.dev/project-delivery-452706/delivery-worker/dev-delivery-worker-orchestration:1
imagePullPolicy: Always
name: test-image-1
resources:
requests:
cpu: "500m"
memory: "1Gi"
nodeSelector:
cloud.google.com/compute-class: Performance
kubernetes.io/arch: arm64

restartPolicy: Always
terminationGracePeriodSeconds: 30"""

When I use this architecture, I encounter the following error: 'Node scale up in zones asia-southeast1-c associated with this pod failed: GCE quota exceeded. Pod is at risk of not being scheduled.'

However, when I switch to the amd64 architecture, it has resources available and can schedule the pod normally. I’ve looked through some documentation and it mentions something about quotas, but I’m not sure what the specific issue is. Could you explain what the problem might be and how I can resolve it if I want to use the arm64 architecture for my pod? Additionally, what is this quota limit? Currently, when I check, none of my quotas are set to 'Unlimited,' and some haven’t even exceeded 10% of their usage.

Is it because the resources for the ARM64 architecture are insufficient to meet the demand in the Southeast Asia region?

Thank you."

0 2 177
2 REPLIES 2

Your Google Kubernetes Engine (GKE) Autopilot cluster is unable to provision additional nodes to accommodate your ARM64-based pod due to quota limitations in the asia-southeast1-c zone. Google Cloud enforces quotas to manage resource usage and ensure fair access across projects. These quotas are applied at both the regional and zonal levels and can vary based on the specific machine types and architectures. In your case, while your overall project quotas may appear underutilized, the quotas specific to ARM64 (T2A) machine types in the asia-southeast1-c zone might be exhausted or set lower than those for AMD64 (x86) machine types.

  • Request a Quota Increase:

    • Identify Specific Quotas: Determine the exact quotas related to ARM64 (T2A) machine types in the asia-southeast1-c  zone. You can do this by navigating to the Google CloudConsole Quotas page and filtering by the relevant metrics. 
    • Submit a Quota Increase Request: If you identify that the quotas are indeed limited, submit a request to increase them. Detailed instructions are available in the GKE Quotas and Limits documentation.cloud.google.com
  • Deploy in a Different Zone or Region

 

You're encountering a classic cloud resource quota issue, specifically related to the availability of ARM64 resources in your Google Cloud region (asia-southeast1) and zone (asia-southeast1-c). Let's break down the problem and how to resolve it:

Understanding the Problem

  1. GCE Quota Exceeded:

    • The error message "Node scale up in zones asia-southeast1-c associated with this pod failed: GCE quota exceeded" indicates that Google Kubernetes Engine (GKE) Autopilot is unable to provision the necessary ARM64 nodes to run your pod.
    • This is directly related to Google Compute Engine (GCE) quotas, which are limits on the resources your Google Cloud project can use.
  2. ARM64 Resource Availability:

    • ARM64 instances are relatively newer compared to AMD64 (x86-64) instances. Therefore, their availability and quota allocation might be more constrained, particularly in specific regions and zones.
    • It is very likely that the ARM64 resources are insufficient to meet the demand in the Southeast Asia region, especially in that specific zone.
  3. Specific Quota Limits:

    • The quota limit that's being hit is likely related to the number of ARM64 virtual CPUs (vCPUs) or the number of ARM64 instances that your project is allowed to create in the asia-southeast1-c zone.
    • It is also possible that there is a quota related to the "Performance" compute class, which you are requesting with your nodeSelector.
    • Even if your overall quotas are not at 100% usage, specific quotas for ARM64 or "Performance" compute class instances in that zone may be exhausted.
    • When dealing with GKE Autopilot, the quota issues are often related to the underlying GCE resources that GKE needs to provision.

Troubleshooting and Resolution

  1. Check Specific Quotas:

    • Go to the Google Cloud Console and navigate to "IAM & Admin" -> "Quotas."
    • Filter by "Service: Compute Engine API."
    • Filter by "Region: asia-southeast1" and "Zone: asia-southeast1-c."
    • Look for quotas related to:
      • ARM64 vCPUs
      • ARM64 instances
      • quotas related to the "Performance" compute class.
    • Compare your current usage to the quota limits.
  2. Request a Quota Increase:

    • If you find that you're hitting a quota limit, you can request a quota increase directly from the Quotas page in the Google Cloud Console.
    • Provide a clear justification for your request, explaining why you need the additional ARM64 resources.
    • Google Cloud support will review your request and approve or deny it based on resource availability and your project's history.
  3. Consider Other Zones or Regions:

    • If increasing the quota is not immediately possible, consider deploying your ARM64 pods to a different zone within the asia-southeast1 region or to a different region altogether.
    • Check the availability of ARM64 resources in other zones/regions using the Google Cloud Console or the gcloud command-line tool.
  4. Reduce Resource Requests:

    • If possible, optimize your application to use fewer resources (CPU and memory). This can help reduce the number of nodes required and potentially avoid quota issues.
    • Review the resources.requests section of your deployment manifest.
  5. Remove the nodeSelector:

    • Remove the nodeSelector from your deployment, and allow GKE autopilot to schedule your pods on any avalible node. This will remove the restriction of only using "Performance" compute class, and only arm64 nodes. If the application can run on standard nodes, or amd64 nodes, this will allow the pods to schedule.
  6. Regional Cluster:

    • If you are currently using a zonal cluster, consider switching to a regional GKE Autopilot cluster. Regional clusters distribute nodes across multiple zones, which can improve availability and reduce the risk of quota issues in a single zone.
  7. Google Cloud Status Dashboard:

    • Check the Google Cloud Status Dashboard for any reported issues or outages in the asia-southeast1 region. This can provide insights into potential resource availability problems.

 

Top Labels in this Space
Top Solution Authors