Hello,
I'm using Google Kubernetes Engine where my cluster's node pool is connected to Compute Engine instances defined by instance templates. I'm controlling these instance templates by Managed Instance Groups.
To the issue: I see that my Instance Template is replaced with a kind of fallback Instance Template that is automatically created. As the fallback Instance Template is of lower machine type it results in an outage within our cluster as some services cannot be scheduled.
I could find logs in the Logs Explorer with the same timestamp on which the new fallback Instance Template was created (screenshot attached). "logs-explorer.png" shows that the Service Account for some reason tries to delete an Instance Group which is not even existing. The logs reflect this by showing an error. Some minutes later it seems that an Instance Template is created. If I go to Compute Engine -> Instance Templates it shows that the fallback Instance Template was created on "Aug 13, 2023, 12:13:03 AM" and is currently used. This means it automatically created this Instance Template and set it as default.
Do you think it's a permission issue on the Instance Templates? I'm seeing that the fallback Instance Template (which shouldn't be used) is configured with the default Service Account and that seems to work consistently. The other Instance Template (which should be used) is configured with a different Service Account (XXXXXXXXXXXX-compute@developer.gserviceaccount.com) and it seems that something is not working there. It works for a certain time, but after some weeks (during the cluster's maintenance window) a fallback Instance Template is created and used as default automatically. Maybe during the maintenance window some permissions are re-fetched and something isn't working as it should be. If that's the right direction, which permissions should I give the service account? If you think it's not a permission issue on the Service Account, what else could be the issue?
I also tested changing permissions to the Service Account (XXXXXXXXXXXX-compute@developer.gserviceaccount.com) with Policy Simulator but receiving errors testing the changes ("policy-simulator.png") which means that Policy Simulator could not determine if the result of the access attempt would change under the proposed allow policy.
Thanks reading and I really appreciate your effort.
Kind regards
logs-explorer