GKE Autopilot cluster and Wanted up a GPU ( Nvidia...

adeeltariq1080 · 07-29-2024 02:35 PM

Hello, I have stucked in a problem that I am using GKE Autopilot cluster ( version: v1.29.5-gke ) and I wanted to deploy a GPU ( Nvidia-l4 or Nvidia-tesla-t4 ) in a us-central1 region but it is showing me in pending state while I am using CUDA as a base image. unfornutately, I am having this error. ``` Normal TriggeredScaleUp 8m1s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/project-ID/zones/us-central1-c/instanceGroups/gk3-clu... 0->1 (max: 1000)}] Warning FailedScaleUp 7m46s cluster-autoscaler Node scale up in zones us-central1-c associated with this pod failed: Internal error. Pod is at risk of not being scheduled. Warning FailedScheduling 3m33s (x2 over 9m3s) gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. Normal TriggeredScaleUp 2m49s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/project-id/zones/us-central1-a/instanceGroups/gk3-clu... 0->1 (max: 1000)}] Warning FailedScaleUp 2m (x3 over 7m23s) cluster-autoscaler Node scale up in zones us-central1-a associated with this pod failed: Internal error. Pod is at risk of not being scheduled. ``` Please GCP experts look upon this problem I need to deploy the GPU. Solve my problem. Thanks

garisingh

Have you checked your quota for GPUs? You might want to check the "GPU (all regions)" quota

adeeltariq1080

What does it mean? I think they gave me unlimited usage or should I request?

suddhasatwa

Hi Adeel,

I think in your screenshot above, we are looking at Committed GPUs and not the overall availability. Please could you only search with L4 and see what sort of quota you have in this region? Deploying GPUs on GKE Autopilot can sometimes be tricky, especially if you're encountering issues with node scaling and pod scheduling. Here are some steps and checks to help you resolve the issue:

Steps to Deploy GPU on GKE Autopilot

1. Check GPU Quotas:
Ensure that you have sufficient GPU quotas in the `us-central1` region. You can check and increase your GPU quotas via the Google Cloud Console.

- Navigate to the [Quotas](https://console.cloud.google.com/iam-admin/quotas) page.
- Filter by `NVIDIA` to see your GPU quotas.
- Request an increase if necessary.

2. Create a Node Pool with GPU:
GKE Autopilot manages nodes for you, but you can create a dedicated node pool with GPUs for your workloads.

- Navigate to the GKE cluster in the Google Cloud Console.
- Create a new node pool with the desired GPU type (e.g., NVIDIA T4).
- Make sure to select the `us-central1` region and a specific zone that supports the GPU type you want.

3. Specify Node Affinity:
Ensure your pod spec includes node affinity to schedule your pod to the GPU node pool.

```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: nvidia/cuda:10.1-base
resources:
limits:
nvidia.com/gpu: 1 # Request a GPU
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gpu-node-pool # Ensure this matches your GPU node pool name
```

4. Install NVIDIA Drivers:
GKE Autopilot should automatically handle the installation of NVIDIA drivers. However, you can verify that the GPU device plugin is correctly installed.

```sh
kubectl get daemonset -n kube-system nvidia-gpu-device-plugin
```

Ensure the daemonset is running and healthy.

5. Configure Cluster Autoscaler:
Make sure the cluster autoscaler is correctly configured to scale up GPU nodes. This can sometimes require specific configurations or annotations.

6. Check Logs for More Information:
The error messages you're seeing (e.g., `FailedScaleUp`, `FailedScheduling`) suggest issues with node scaling and scheduling. Check the detailed logs of the cluster autoscaler and scheduler to understand why the scaling is failing.

```sh
kubectl logs -n kube-system deployment/cluster-autoscaler
```

### Example YAML for Deployment

Here is an example deployment YAML file for deploying a pod with GPU requirements:

apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
containers:
- name: gpu-container
image: nvidia/cuda:10.1-base
resources:
limits:
nvidia.com/gpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gpu-node-pool

### Troubleshooting Tips

1. **Ensure GPU Availability in Zone:**
Confirm that the selected zone (`us-central1-a`, `us-central1-c`, etc.) has GPU availability.

2. **Node Taints and Tolerations:**
Make sure to use tolerations if your GPU nodes have taints.

```yaml
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
```

3. **Check Cluster Configuration:**
Sometimes, cluster configurations might prevent GPU nodes from scaling up. Check for any specific policies or settings in your GKE Autopilot cluster that might restrict GPU nodes.

4. **GKE Autopilot Limitations:**
GKE Autopilot has some limitations compared to standard GKE clusters. If your workload requires specific configurations that Autopilot doesn't support, consider using a standard GKE cluster instead.

### Additional Resources

- [GKE Autopilot Documentation](https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview)
- [Using GPUs with GKE](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)
- [Cluster Autoscaler on GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler)

By following these steps and ensuring your cluster is properly configured for GPU workloads, you should be able to successfully deploy your GPU-enabled pods on GKE Autopilot.

Please let us know how this goes.

Thank you
Suddhasatwa

garisingh

Commited means you can commit to purchase an unlimited number of L4s, but it is not the same as the on-demand quota. You should also click on the link in my previous post as there is also a base quota for total GPUs of all types (which defaults to 0)

GKE Autopilot cluster and Wanted up a GPU ( Nvidia-l4 or Nvidia-tesla-t4 )