Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GKE: Workload logging broken since 1.30.9 auto-update?

We haven't got any application logs in bigquery since the weekend's auto update.

kube-system is still getting there so routing is OK - looks like It's doing "system" but not "workload"?

Any ideas? Downgrade?

Solved Solved
0 3 380
1 ACCEPTED SOLUTION

So checking the logs helped.

The version update was coincidental - although it might have introduced some changes to some of the log messages which is what prompted the real problem which is that any variatrion in the log messages causes the creation of new columns. (One of ingresses logs http headers and that includes anything that the client might have in them). Over time these build up and exceeed the 10K column limit.

The lack of control over the schema is a big problem with bigquery as a logging sink. It already floods us with noise emails about incompatible schema when two components log incompatible things. That in turn makes it harder to see the real problems as you get into habit of ignoring the mails from logging.

So the actual fix was just to delete all of the historical logs and start again.

In the medium term we'll probably get off bigquery as the schema is a all a bit of kludge and use something that we can actually configure.

View solution in original post

3 REPLIES 3

Hi @Tim_Murnaghan,

Welcome to Google Cloud Community!

This page provides production updates for Google Kubernetes Engine (GKE), including announcements on updated features, bug fixes, known issues, and deprecated functionality. It covers release notes for all channels and versions, allowing you to stay informed about the latest changes.

To address your workload logging issues, consider the following steps:

  1. Workload Logging Configuration - Verify if the GKE Workload Logging feature is still enabled in the cluster. Additionally, review the logging filters in Cloud Logging (Logs Explorer) to determine if workload logs are reaching GCP but not being forwarded to BigQuery.

- Pod Logs: Check the pod logs in your application namespaces if there are any errors related to logging.

- Logging Driver: Ensure the logging driver specified in your pods' containers is correctly configured. 

  1. Google Cloud Logging Router Sinks - Check the configuration of your Google Cloud Logging router sink to confirm it still includes workload logs for forwarding to BigQuery. Ensure the sink's filter is correctly set to capture and route workload logs as intended.
  2. Network Connectivity - Verify that pods have network connectivity to BigQuery by checking firewall rules and network policies to confirm that logs can be transmitted successfully.
  3. Inspect the Logging Agents (fluentd/fluent-bit) for Workload-Specific Issues - If you're using Cloud Logging through Fluent Bit or a logging sidecar, verify if it's running correctly on your nodes. If any logging-related pods, such as fluentbit-gke, are restarting or stuck in CrashLoopBackOff, review their logs for errors.
  4. Try Restarting Workloads - If Fluent Bit is only capturing logs from new workloads, restarting certain pods might help.

By reviewing workload-specific logging configurations and troubleshooting steps, you can identify why workload logs are not reaching BigQuery. If the issue persists, contact Google Cloud Support with detailed logs, configurations, and error messages for further assistance.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

The sudden halt of application logs flowing into BigQuery after a GKE auto-update strongly suggests a compatibility issue or a configuration change introduced by the update. Let's troubleshoot this systematically:

1. Confirm Logging Agent Status

  • Check DaemonSet/Deployment:
    • Determine how your logging agent is deployed (DaemonSet or Deployment). It's most likely a DaemonSet for cluster-wide log collection.
    • Use kubectl get daemonset -n <logging-agent-namespace> or kubectl get deployment -n <logging-agent-namespace> to check its status.
    • Ensure all pods are running and ready.
  • Inspect Logging Agent Logs:
    • Use kubectl logs -n <logging-agent-namespace> <logging-agent-pod> to examine the logging agent's logs.
    • Look for error messages, warnings, or anything indicating a connection problem with BigQuery.
    • Look for logs that indicate that workload logs are being skipped.

2. Verify BigQuery Logging Configuration

  • Check Logging Agent Configuration:
    • Examine the configuration of your logging agent (e.g., Fluentd, Fluent Bit, Ops Agent).
    • Ensure that the BigQuery output plugin is correctly configured.
    • Verify that the BigQuery dataset and table names are accurate.
    • Verify that any filters that are configured, are not excluding the workload logs.
  • IAM Permissions:
    • Double-check that the service account used by the logging agent has the necessary IAM permissions to write to BigQuery.
    • Specifically, it should have the BigQuery Data Editor role or equivalent.
  • BigQuery Dataset and Table:
    • Confirm that the BigQuery dataset and table exist.
    • Check if there are any recent changes to the dataset or table schema.

3. GKE Update Impact

  • Kubernetes Version Changes:
    • GKE updates can introduce changes to the Kubernetes API, which might affect logging agent configurations.
    • Check the GKE release notes for any relevant changes related to logging or monitoring.
  • Container Runtime Changes:
    • Updates might involve changes to the container runtime (e.g., containerd), which can affect log file locations or formats.
    • Verify that the logging agent is configured to read logs from the correct location.
  • Ops Agent changes:
    • If you are using the google cloud ops agent, there may have been a change to its configuration, or a bug introduced in the update. Check the google cloud ops agent release notes.

4. Filtering Issues

  • Namespace Filtering:
    • It is possible that the logging agent has a filter that is now excluding your application's namespace.
    • Because the kube-system logs are still flowing, it is very likely that a filter is in place.
  • Log Format Changes:
    • If your application's log format has changed, the logging agent might not be able to parse it correctly.
    • Check if your application has made any recent changes to its logging output.

5. Potential Solutions

  • Restart Logging Agent Pods:
    • Try restarting the logging agent pods to see if it resolves the issue.
    • kubectl rollout restart daemonset -n <logging-agent-namespace> <logging-agent-name>
  • Review Logging Agent Configuration:
    • Carefully review the logging agent's configuration to ensure that it is compatible with the updated GKE version.
    • Pay special attention to filters.
  • Check GKE Release Notes:
    • Look for known issues or workarounds in the GKE release notes.
  • Downgrade GKE (Last Resort):
    • If you suspect a bug in the GKE update, consider downgrading to a previous stable version.
    • However, this should be a last resort, as downgrading can introduce other issues.
  • Verify Ops Agent Configuration
    • If you are using the ops agent, verify the configuration for workload logging, and check the release notes for any breaking changes.

Troubleshooting Steps

  1. Check Logging Agent Logs: Start by examining the logging agent's logs for error messages.
  2. Verify IAM Permissions: Ensure that the service account has the necessary BigQuery permissions.
  3. Inspect Logging Agent Configuration: Look for any configuration changes that might have been introduced by the GKE update.
  4. Check for Namespace/Log Format Filtering: Verify that no filters are inadvertently excluding your application logs.

By following these steps, you should be able to identify the cause of the problem and restore your application logs to BigQuery.

So checking the logs helped.

The version update was coincidental - although it might have introduced some changes to some of the log messages which is what prompted the real problem which is that any variatrion in the log messages causes the creation of new columns. (One of ingresses logs http headers and that includes anything that the client might have in them). Over time these build up and exceeed the 10K column limit.

The lack of control over the schema is a big problem with bigquery as a logging sink. It already floods us with noise emails about incompatible schema when two components log incompatible things. That in turn makes it harder to see the real problems as you get into habit of ignoring the mails from logging.

So the actual fix was just to delete all of the historical logs and start again.

In the medium term we'll probably get off bigquery as the schema is a all a bit of kludge and use something that we can actually configure.

Top Labels in this Space
Top Solution Authors