Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Failed to upgrade nodepool to containerd

Following the instructions and script here https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-containerd

 

gcloud container clusters upgrade 'xxxx-production' --project 'project-id-xxxxx' --zone 'us-xxxxx' --image-type 'COS_CONTAINERD' --node-pool 'default-pool'

 

resulted in following error message

 

All nodes in node pool [default-pool] of cluster [xxxxx-production] image will change from COS to COS_CONTAINERD. This operation is long-running and will block other operations on the 
cluster (including delete) until it has run to completion.

Do you want to continue (Y/n)?  Y

Upgrading xxxx-production... Updating default-pool, done with 0 out of 3 nodes (0.0%): 1 being processed...done.                                                                      
ERROR: (gcloud.container.clusters.upgrade) Operation [<Operation
 clusterConditions: [<StatusCondition
 canonicalCode: CanonicalCodeValueValuesEnum(NOT_FOUND, 5)
 message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>]
 detail: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'
 endTime: '2023-05-13T01:08:09.455974926Z'
 error: <Status
 code: 5
 details: []
 message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>
 name: 'operation-1683938273861-...........'
 nodepoolConditions: []
 operationType: OperationTypeValueValuesEnum(UPGRADE_NODES, 4)
 progress: <OperationProgress
 metrics: [<Metric
 intValue: 3
 name: 'NODES_TOTAL'>, <Metric
 intValue: 1
 name: 'NODES_FAILED'>, <Metric
 intValue: 0
 name: 'NODES_COMPLETE'>, <Metric
 intValue: 1
 name: 'NODES_DONE'>, <Metric
 intValue: 0
 name: 'NODE_PDB_DELAY_SECONDS'>]
 stages: []>
 selfLink: 'https://container.googleapis.com/v1/projects/..........'
 startTime: '2023-05-13T00:37:53.861775789Z'
 status: StatusValueValuesEnum(DONE, 3)
 statusMessage: 'Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-tuog not found.'
 targetLink: 'https://container.googleapis.com/v1/projects/....'
 zone: 'us-xxx'>] finished with error: Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.

 

Any ideas on how to debug this? (worked on both our staging clusters flawlessly)

1 12 1,009
12 REPLIES 12

Willbin
Former Googler

Hello @waynep,

Welcome to Google Cloud Community!

Can you confirm the version of the cluster you are currently using? Thanks

cluster 1.23.17-gke.1700

pool is on 1.23.16-gke.200

perhaps the node pool should be upgraded to match the cluster first?

Have you also tried changing the node image to a containerd image via console?


  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to verify.

  3. Click the Nodes tab.

  4. In the Node pools section, click the name of the node pool that you want to modify.

  5. On the Node pool details page, click  Edit.

  6. In the Nodes section, under Image type, click Change.

  7. Select one of the containerd image types.

  8. Click Change.


 

have not tried via console, only command line. Will schedule another maintenance window and try again!

appreciate the advice.

Great, thanks!

@Willbin retried again via console and it failed in 30 mins again. Ran the same command line upgrade and notice it failed w/ the following, however the node that was looking for doesn't exist in the pool. Any ideas?

 

finished with error: Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-axzh not found.

Screenshot 2023-07-08 at 10.09.54 AM.png 

Interesting issue, we can notice that the instance error have not entrance at your instance list...
What is your node management config:

  • Upgrade strategy
  • Surge upgrade
  • Max surge
  • Max unavailable

@andre_guimarae1 I'm assuming turning off auto-scaler as garisingh mentioned that would be it but to answer your question

waynep_0-1689253679961.png

 

Yes, I think so too!
Good luck my friend and let us know the results.

@waynep  - one thing you can try is to disable the cluster autoscaler on that pool before doing the upgrade / update.

ah... @garisingh that makes sense. will give it a shot.

 

 

to close the loop, disabling auto-scaling resolved this issue. thanks for the help @garisingh !

Top Labels in this Space