Following the instructions and script here https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-containerd
gcloud container clusters upgrade 'xxxx-production' --project 'project-id-xxxxx' --zone 'us-xxxxx' --image-type 'COS_CONTAINERD' --node-pool 'default-pool'
resulted in following error message
All nodes in node pool [default-pool] of cluster [xxxxx-production] image will change from COS to COS_CONTAINERD. This operation is long-running and will block other operations on the
cluster (including delete) until it has run to completion.
Do you want to continue (Y/n)? Y
Upgrading xxxx-production... Updating default-pool, done with 0 out of 3 nodes (0.0%): 1 being processed...done.
ERROR: (gcloud.container.clusters.upgrade) Operation [<Operation
clusterConditions: [<StatusCondition
canonicalCode: CanonicalCodeValueValuesEnum(NOT_FOUND, 5)
message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>]
detail: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'
endTime: '2023-05-13T01:08:09.455974926Z'
error: <Status
code: 5
details: []
message: 'Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.'>
name: 'operation-1683938273861-...........'
nodepoolConditions: []
operationType: OperationTypeValueValuesEnum(UPGRADE_NODES, 4)
progress: <OperationProgress
metrics: [<Metric
intValue: 3
name: 'NODES_TOTAL'>, <Metric
intValue: 1
name: 'NODES_FAILED'>, <Metric
intValue: 0
name: 'NODES_COMPLETE'>, <Metric
intValue: 1
name: 'NODES_DONE'>, <Metric
intValue: 0
name: 'NODE_PDB_DELAY_SECONDS'>]
stages: []>
selfLink: 'https://container.googleapis.com/v1/projects/..........'
startTime: '2023-05-13T00:37:53.861775789Z'
status: StatusValueValuesEnum(DONE, 3)
statusMessage: 'Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-tuog not found.'
targetLink: 'https://container.googleapis.com/v1/projects/....'
zone: 'us-xxx'>] finished with error: Google Compute Engine: Managed instance gke-xxx-default-pool-4a9ae595-tuog not found.
Any ideas on how to debug this? (worked on both our staging clusters flawlessly)
Hello @waynep,
Welcome to Google Cloud Community!
Can you confirm the version of the cluster you are currently using? Thanks
cluster 1.23.17-gke.1700
pool is on 1.23.16-gke.200
perhaps the node pool should be upgraded to match the cluster first?
Have you also tried changing the node image to a containerd image via console?
Go to the Google Kubernetes Engine page in the Google Cloud console.
In the cluster list, click the name of the cluster you want to verify.
Click the Nodes tab.
In the Node pools section, click the name of the node pool that you want to modify.
On the Node pool details page, click edit Edit.
In the Nodes section, under Image type, click Change.
Select one of the containerd image types.
Click Change.
have not tried via console, only command line. Will schedule another maintenance window and try again!
appreciate the advice.
Great, thanks!
@Willbin retried again via console and it failed in 30 mins again. Ran the same command line upgrade and notice it failed w/ the following, however the node that was looking for doesn't exist in the pool. Any ideas?
finished with error: Google Compute Engine: Managed instance gke-xxxx-default-pool-4a9ae595-axzh not found.
Interesting issue, we can notice that the instance error have not entrance at your instance list...
What is your node management config:
@andre_guimarae1 I'm assuming turning off auto-scaler as garisingh mentioned that would be it but to answer your question
Yes, I think so too!
Good luck my friend and let us know the results.
@waynep - one thing you can try is to disable the cluster autoscaler on that pool before doing the upgrade / update.
to close the loop, disabling auto-scaling resolved this issue. thanks for the help @garisingh !