Node pool in GKE standard cluster with GPU and tim...

some-user-921

I have a node pool with GPU configuration, GPU sharing enabled with Time-sharing for strategy and "Max shared clients per GPU" as 48. The node(s) run fine but I'm unable to run workloads on them using the documented nodeSelector config for my workload, e.g.

nodeSelector:

cloud.google.com/gke-accelerator: "nvidia-tesla-t4"

cloud.google.com/gke-max-shared-clients-per-gpu: "48"

cloud.google.com/gke-gpu-sharing-strategy: time-sharing

With this my pods get stuck in pending status with "x nodes didn't match Pod's node affinity/selector". If I remove the "gke-max-shared-clients-per-gpu" and "gke-gpu-sharing-strategy" key pairs, the pod schedules and runs fine.

When I check the kubernetes labels on the nodes in the gpu time sharing node pool, they do NOT include these labels and I can't add them manually because GCP prevents it.

Am I missing a step in here to get GPU time sharing working?

JanR

Hi @some-user-921 ,

The issue you are experiencing might be the cause of your GKE version. Check your version if it's 1.29.3 or later since its the minimum requirement for the GKE version for time sharing which is only available on rapid channel at the moment.

If the issue still persists, consider seeking assistance from Google Cloud Support.

garisingh

Since you mention nodepool here, I assume you are running on GKE standard mode?

Node pool in GKE standard cluster with GPU and time-sharing enabled is not matchable by workloads