Failed to create cluster with GPU. No detail failu...

z165 · 11-03-2022 12:20 AM

I am creating a standard cluster with GPU supporting. The command takes more than 30mins and results in failing without any reason.
I have enough quota for GPU in us-central1-c.

The command to create standard cluster.

$ gcloud container clusters create "cluster-007" \
--accelerator type=nvidia-tesla-t4,count=1 \
--zone "us-central1-c" \
--release-channel "stable" \
--machine-type "n1-standard-8" \
--num-nodes "1" \
--enable-autoscaling --min-nodes "1" --max-nodes "1"

The response during the 30 mins:

Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the `--no-enable-ip-alias` flag
Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag.
Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Creating cluster cluster-007 in us-central1-c... Cluster is being deployed...done.
ERROR: (gcloud.container.clusters.create) Operation [<Operation
clusterConditions: [<StatusCondition
canonicalCode: CanonicalCodeValueValuesEnum(DEADLINE_EXCEEDED, 4)
message: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'>]
detail: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'
endTime: '2022-11-03T07:14:47.492309591Z'
error: <Status
code: 4
details: []
message: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'>
name: 'operation-1667457432871-b9399edc'
nodepoolConditions: []
operationType: OperationTypeValueValuesEnum(CREATE_CLUSTER, 1)
progress: <OperationProgress
metrics: [<Metric
intValue: 4
name: 'CLUSTER_CONFIGURING'>, <Metric
intValue: 4
name: 'CLUSTER_CONFIGURING_TOTAL'>, <Metric
intValue: 7
name: 'CLUSTER_DEPLOYING'>, <Metric
intValue: 7
name: 'CLUSTER_DEPLOYING_TOTAL'>]
stages: []>
selfLink: 'https://container.googleapis.com/v1/projects/238291653466/zones/us-central1-c/operations/operation-1...'
startTime: '2022-11-03T06:37:12.871293265Z'
status: StatusValueValuesEnum(DONE, 3)
statusMessage: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'
targetLink: 'https://container.googleapis.com/v1/projects/238291653466/zones/us-central1-c/clusters/cluster-007'
zone: 'us-central1-c'>] finished with error: Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .
username@cloudshell:~ (project_name)$

The command is executed in CLOUD SHELL Terminal. I can create standard cluster without GPU option.

Can someone please show how to check the detail error message?

ErnestoC

Since you are receiving an error traceback with no message, the appropriate support channel would be to submit a support case with Google Cloud. In direct support cases, your project can be internally reviewed.

Failed to create cluster with GPU. No detail failure message is shown.