Failed to upgrade nodepool to containerd

waynep · 05-15-2023 12:48 PM

Following the instructions and script here https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-containerd

gcloud container clusters upgrade 'xxxx-production' --project 'project-id-xxxxx' --zone 'us-xxxxx' --image-type 'COS_CONTAINERD' --node-pool 'default-pool'

resulted in following error message

All nodes in node pool [default-pool] of cluster [xxxxx-production] image will change from COS to COS_CONTAINERD. This operation is long-running and will block other operations on the 
cluster (including delete) until it has run to completion.

Do you want to continue (Y/n)?  Y

Upgrading xxxx-production... Updating default-pool, done with 0 out of 3 nodes (0.0%): 1 being processed...done.                                                                      
ERROR: (gcloud.container.clusters.upgrade) Operation [<Operation
 clusterConditions: [<StatusCondition
 canonicalCode: CanonicalCodeValueValuesEnum(NOT_FOUND, 5)
 message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>]
 detail: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'
 endTime: '2023-05-13T01:08:09.455974926Z'
 error: <Status
 code: 5
 details: []
 message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>
 name: 'operation-1683938273861-...........'
 nodepoolConditions: []
 operationType: OperationTypeValueValuesEnum(UPGRADE_NODES, 4)
 progress: <OperationProgress
 metrics: [<Metric
 intValue: 3
 name: 'NODES_TOTAL'>, <Metric
 intValue: 1
 name: 'NODES_FAILED'>, <Metric
 intValue: 0
 name: 'NODES_COMPLETE'>, <Metric
 intValue: 1
 name: 'NODES_DONE'>, <Metric
 intValue: 0
 name: 'NODE_PDB_DELAY_SECONDS'>]
 stages: []>
 selfLink: 'https://container.googleapis.com/v1/projects/..........'
 startTime: '2023-05-13T00:37:53.861775789Z'
 status: StatusValueValuesEnum(DONE, 3)
 statusMessage: 'Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-tuog not found.'
 targetLink: 'https://container.googleapis.com/v1/projects/....'
 zone: 'us-xxx'>] finished with error: Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.

Any ideas on how to debug this? (worked on both our staging clusters flawlessly)

Willbin

Hello @waynep,

Welcome to Google Cloud Community!

Can you confirm the version of the cluster you are currently using? Thanks

waynep

cluster 1.23.17-gke.1700

pool is on 1.23.16-gke.200

perhaps the node pool should be upgraded to match the cluster first?

Willbin

Have you also tried changing the node image to a containerd image via console?

Go to the Google Kubernetes Engine page in the Google Cloud console.

Go to Google Kubernetes Engine

In the cluster list, click the name of the cluster you want to verify.

Click the Nodes tab.

In the Node pools section, click the name of the node pool that you want to modify.

On the Node pool details page, click Edit.

In the Nodes section, under Image type, click Change.

Select one of the containerd image types.

Click Change.

waynep

have not tried via console, only command line. Will schedule another maintenance window and try again!

appreciate the advice.

Willbin

Great, thanks!

waynep

@Willbin retried again via console and it failed in 30 mins again. Ran the same command line upgrade and notice it failed w/ the following, however the node that was looking for doesn't exist in the pool. Any ideas?

finished with error: Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-axzh not found.

andre_guimarae1

Interesting issue, we can notice that the instance error have not entrance at your instance list...
What is your node management config:

Upgrade strategy
Surge upgrade
Max surge
Max unavailable

waynep

@andre_guimarae1 I'm assuming turning off auto-scaler as garisingh mentioned that would be it but to answer your question

andre_guimarae1

Yes, I think so too!
Good luck my friend and let us know the results.

garisingh

@waynep - one thing you can try is to disable the cluster autoscaler on that pool before doing the upgrade / update.

waynep

ah... @garisingh that makes sense. will give it a shot.

waynep

to close the loop, disabling auto-scaling resolved this issue. thanks for the help @garisingh !