Critical Kubernetes events (image pull failures) a...

rubenvann · 01-17-2025 02:51 AM

We are using Google Kubernetes Engine to schedule critical jobs. We have seen image pull failures that happened silently, because no errors are logged for these events. There are only log entries with severity DEFAULT and WARNING.

Is there a way to configure either GCP or the cluster itself to somehow have the appropriate severity set in the log entries? I want to avoid manual work per workload or looking at log entries that have a severity less than ERROR, this is not scalable and not robust for new workloads and different errors in the future (and also breaks core principles in our alerting setup, which assumes that errors are logged with the appropriate severity).

To be specific, I see different log entries:

In the "kubelet" logs:
> E0117 10:15:25.576033 1909 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cron-refresh-cache\" with ImagePullBackOff: \"Back-off pulling image...

In the "events" log (in the "default" namespace):

> Error: ImagePullBackOff

> Error: ErrImagePull

> Failed to pull image <path>: rpc error: code = NotFound desc = failed to pull and unpack image...

The log entries in the "events" log seem to be most useful to have as errors (it's less noisy).

jayeshmahajan

1. You can increase verbose settings in kubelet as long as you know how to customize it. However there are high chances it will increase your logging data and spike in log entries.

spec:
containers:
- name: kubelet
command:
- kubelet
args:
- --v=2 # Adjust verbosity level

2. I would rather setup alerting for this "specific" type of alerts by filtering the logs instead from cloud logging

resource.type="k8s_cluster"
textPayload:"ImagePullBackOff" OR textPayload:"ErrImagePull"
severity>=ERROR

3. Use tools like eventrouter to sink the data to some streaming system "pub/sub, kafka" etc to take some action based on those events.

mokit

Hi, @rubenvann.

It's a typical Kubernetes log format where events such as "ImagePullBackOff" and "ErrImagePull" are recorded with low severity levels, like "WARNING" or "DEFAULT". To meet our requirements, we can adjust the logging configurations or set up custom alerting rules to capture these logs with a higher severity, such as "ERROR".

So, could you please let me know which monitoring and alerting tools are being used in your case? Also, could you please let me know how the cluster was created? Was it provisioned using Autopilot or the standard mode?

If you're using Google Cloud Logging, you can aggregate logs from GKE and configure Log-based Metrics or Alert Policies to capture specific log patterns, such as "ImagePullBackOff" or "ErrImagePull". Set the severity level of these metrics to ERROR in your alerting policies.

Regards,
Mokit

rubenvann

@mokit

> To meet our requirements, we can adjust the logging configurations

Can you elaborate on what configuration I need to change and how I need to change it in order to log errors for critical Kubernetes failures?

rubenvann

Thank you both for your replies!

@jayeshmahajan thank you, I will consider your suggestions. I would have to find a way to test the first option safely. For the second one, I think the filtering should not include severity >= ERROR, that's the point of this topic. I think ideally these messages should be logged as an error, but they aren't. But if it's the most practical, I will just change the log filtering query that our custom alerting system uses. I don't consider the third option a viable way for us because it adds a lot of complexity to our setup (adding external tools to our stack), and it doesn't sound like it will give a lot of benefit.

@mokit We have our cluster set up using autopilot. We use a custom alerting system that creates an alert for every log entry with severity >= ERROR. We do this specifically to avoid having to set up manual alerting for different error conditions.

This (of course) has the assumption that every piece of managed infrastructure will also log an error if it breaks or fails to run (which currently is not the case for our Kubernetes setup). If this is not the case, it seems to me it is impossible to properly monitor the cluster, as random workloads can just fail to run without actionable feedback in the form of an error.

I understand that this in general is a hard problem because infrastructure tooling might use other logging formats "under the hood", but being Google-managed infrastructure I think it's reasonable to expect an error in the logs when a Kubernetes workload fails to run. This condition is properly detected in the cloud console, because I do see a red icon indicating failure.

rubenvann

After a brief investigation it looks like kubelet events are never errors, and the highest severity they support is "Warning"... This then gets faithfully used as the severity in the log entry as well.

I'd argue that a warning is not sufficient when a job does not run, it should be an error. I don't understand the idea behind this... How can I detect errors? We have a bunch of different resources running on GCP, I don't understand why Kubernetes doesn't log errors on failure.

So the whole chain of events seems a bit odd. Kubernetes fails to launch a pod, logs a *warning* for it, containing the message "Error: ErrImagePull". GCP then parses this text to display a red exclamation mark icon in the status of the managed pod. But not a single error in the logs.

askingmuch

We have a similar problem: all logs from core Kubernetes components are not parsed properly and seems that we don't have a way to configure it. The only workaround is to switch off the agents and collect the logs manually, but that might be tricky for system level node logs.

Having logs like this with DEFAULT level is not useful:

E0601 21:47:37.642104 2219 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"xxx\" with CreateContainerConfigError: \"secret \\\"xxx-auth\\\" not found\"" pod="xxx-staging/xxx-59fbb59c44-29r59" podUID="ded6d0f8-ee9c-44ef-xxxx-6ead3f7b04a7"

For now we have a log-based alarm as a workaround.

Critical Kubernetes events (image pull failures) are logged with low severity