unscheduled Google Computation Engine Kubernetes V...

ChristianGlied · 11-23-2021 04:44 AM

We are currently using multiple (5) machines in the Google Compute Engine in one Google Computation Engine Zone to host a Kubernetes cluster.

On the 13.11.2021 between 10:30 CEST and 10:50 CEST all of those machines were restarted. This was not issued by any of our team members.

We have one other VM running hosted in the same Google Compute Engine not being part of the cluster. That VM did not restart, so it might be related to Kubernetes.

The logs did not indicate any manually scheduled restart, from what I could see so far and the Google Cloud Status Dashboard also did not indicate any issues leading to that reaction during that timeframe.

Only thing was an error message during reboot:

Error updating SSH keys for root: mkdir /root/.ssh: read-only file system.

I am not sure whether this is linked to the restart issue or another problem regarding wrong permissions being set.

My question is:

Are there any known actions on the google side or other reasons that could be the cause for the machines’ restarts ( e.g. moving the VMs onto another node ) ?

Best regards,

Christian Glied

cloudpaul

Hey Christian,

Host failures and restarts are, unfortunately, a common occurrence across a server fleet as large as ours! If this does happen, by default, we Live Migrate VMs if there is an underlying host issue. You might want to check that your instances are set to do this, and not Terminate (and Restart).

You also mention that you're running everything in a single Availability Zone - we would recommend spreading across multiple zones to help prevent this type of issue from recurring. Take a look at Designing Robust Systems for more info.

Unfortunately I don't have any insight into the exact cause of what happened here - but hopefully the above is able to help you mitigate against it in the future.

unscheduled Google Computation Engine Kubernetes VM restarts