I'm trying to investigate a very strange issue on our GKE clusters where metrics for all of our Cronjobs mysteriously stopped working around 22:05 on the 20th April. As far as I have been able to determine so far:
An example of the CronJob deployment details screen for one job shows an abrupt end to the metrics and nothing in the four days since despite the job continuing to run on the same schedule:
It's the same for a completely unrelated job that runs in a different node pool in the same cluster:
Metrics for non Cronjob deployments on the same cluster still work as expected:
Alerting policies based on log counts for the Cronjobs also stop at the same time despite logs still being written by the containers:
As I mentioned the issue is affecting all CronJobs across multiple clusters with the only commonality being:
It's possible that the issue is being caused by the difference in version between the control plane and nodes, however I can't see anything in the logs that suggests the control plane was updated at that time. Any other suggestions welcome.
Hello george-blis,
Welcome to Google Cloud Community!
It is guaranteed that control planes are compatible with nodes up to two minor versions older than the control plane. For example, GKE 1.23 control planes are compatible with GKE 1.21 nodes.
See Kubernetes version and version skew support policy
To further inspect your project, it would be best to be in touch with the Cloud Platform Support.
https://cloud.google.com/contact
Hi Willbin,
Thanks for the welcome and the confirmation regarding the Kubernetes version skew. Unfortunately my organisation only has basic support so I can only raise tickets relating to billing support and I get directed here to the community support channel if I try to raise anything else.