All pods in my Cluster starts randomly

Robinwilliam15 · 09-03-2024 05:04 AM

The pods on my GKE pods restarts randomly Sometime it happens after 7 days sometimes it takes only 1 - 2 days time. Upon checking the logs it is showing that whole nodepool is drained and recreated and then pods are started on the new nodepool. So I want to prevent this because it is happening on my production environment although it is down for only 15 sec . But we cannot take any downtime so is there any way I can check the main reason why it is happening also how to prevent this in future. I have setup uptime check which is failing when I check the Pods all the pods are being recreated on the new nodepool and the creation time is same of all the pods.

garisingh

My best guess at this point is that you likely have auto-upgrade enabled for your node pools (which is the default).
You can check to see if any upgrades have occurred recently by running `gcloud container operations list`.

Assuming this is the case, you'll likely want to either configure Maintenance Windows and/or Maintenance Exclusions so that upgrade only happen when you can tolerate downtime. With Maintenance exclusions, you could select "no minor or node upgrades" scope to prevent upgrades for 6 months at a time until the GKE version goes end of support.

Robinwilliam15

On 2nd september my whole nodepool got created at 2:00pm ist then at 5:00pm again my whole nodepool got created why the upgrade happened two times ? I just want to disable every automatic upgrade i will upgrade it myself only. Can i use blue-green upgrade option instead of the surge option . will it prevent downtime?

garisingh

It's odd that it was upgraded twice in such a short period. Any chance someone made a change to the nodepool configuration? There are a few properties that will result in a nodepool recreate if you change them.

In terms of disabling auto-upgrades, take a look at my post above. I'd suggest using a Maintenance Exclusion with the "no node upgrades" scope.

Blue / Green can help depending on the number of replicas you have, whether or not you have a PDB (Pod Disruption Budget) set, etc.

Robinwilliam15

how can check the upgrade logs of GKE?

garisingh

In Logs Explorer, you can run a query like

resource.type="gke_nodepool"
(log_id("cloudaudit.googleapis.com/activity") OR log_id("cloudaudit.googleapis.com/data_access"))
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
resource.labels.cluster_name="YOUR_CLUSTER_NAME"
resource.labels.nodepool_name="YOUR_NODEPOOL_NAME"

for nodepool upgrades and a query like

resource.type="gke_cluster"
(log_id("cloudaudit.googleapis.com/activity") OR log_id("cloudaudit.googleapis.com/data_access"))
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
  OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name="YOUR_CLUSTER_NAME"

for cluster upgrades.