Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Failed to create cluster with GPU. No detail failure message is shown.

 

I am creating a standard cluster with GPU supporting. The command takes more than 30mins and results in failing without any reason. 
I have enough quota for GPU in us-central1-c.

The command to create standard cluster.

$ gcloud container clusters create "cluster-007" \
--accelerator type=nvidia-tesla-t4,count=1 \
--zone "us-central1-c" \
--release-channel "stable" \
--machine-type "n1-standard-8" \
--num-nodes "1" \
--enable-autoscaling --min-nodes "1" --max-nodes "1"


The response during the 30 mins:

Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the `--no-enable-ip-alias` flag
Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag.
Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Creating cluster cluster-007 in us-central1-c... Cluster is being deployed...done.
ERROR: (gcloud.container.clusters.create) Operation [<Operation
clusterConditions: [<StatusCondition
canonicalCode: CanonicalCodeValueValuesEnum(DEADLINE_EXCEEDED, 4)
message: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'>]
detail: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'
endTime: '2022-11-03T07:14:47.492309591Z'
error: <Status
code: 4
details: []
message: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'>
name: 'operation-1667457432871-b9399edc'
nodepoolConditions: []
operationType: OperationTypeValueValuesEnum(CREATE_CLUSTER, 1)
progress: <OperationProgress
metrics: [<Metric
intValue: 4
name: 'CLUSTER_CONFIGURING'>, <Metric
intValue: 4
name: 'CLUSTER_CONFIGURING_TOTAL'>, <Metric
intValue: 7
name: 'CLUSTER_DEPLOYING'>, <Metric
intValue: 7
name: 'CLUSTER_DEPLOYING_TOTAL'>]
stages: []>
selfLink: 'https://container.googleapis.com/v1/projects/238291653466/zones/us-central1-c/operations/operation-1...'
startTime: '2022-11-03T06:37:12.871293265Z'
status: StatusValueValuesEnum(DONE, 3)
statusMessage: 'Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .'
targetLink: 'https://container.googleapis.com/v1/projects/238291653466/zones/us-central1-c/clusters/cluster-007'
zone: 'us-central1-c'>] finished with error: Deploy error: Not all instances running in IGM after 35m11.161407291s. Expected 1, running 0, transitioning 1. Current errors: .
username@cloudshell:~ (project_name)$

The command is executed in CLOUD SHELL Terminal. I can create standard cluster without GPU option.

Can someone please show how to check the detail error message?

0 1 856
1 REPLY 1

Since you are receiving an error traceback with no message, the appropriate support channel would be to submit a support case with Google Cloud. In direct support cases, your project can be internally reviewed.

Top Labels in this Space