Fluentbit parsing issues

carstenthiel-te · 05-08-2023 05:19 AM

For the past week we've been seeing a massive increase of fluentbit log parsing errors in Log Explorer. They are being reported on node level in the order of several thousand entries per hour.

jsonPayload: {
  message: "cannot parse '0503 23:59:59.678415' after %L"
  plugin: "parser"
}
logName: "projects/***************/logs/fluentbit"
receiveTimestamp: "2023-05-04T00:00:04.123812634Z"
resource: {
  labels: {
    cluster_name: "cluster"
    location: "europe-north1-a"
    node_name: "************************"
    project_id: "************************"
  }
  type: "k8s_node"
}

This is on Kubernetes 1.25.8-gke.500 with e2-standard-4 nodes and ConfigMap version fluentbit-gke-config-v1.2.0 (GKE default Fluent Bit v1.8.12 installation)

Willbin

Hello @carstenthiel-te,

Welcome to Google Cloud Community!

You can configure the Fluent Bit to filter certain information to be logged.

Open the kubernetes/fluentbit-configmap.yaml file in an editor.

Uncomment the lines after ### sample log scrubbing filters and before ### end sample log scrubbing filters.

Change the name of the ConfigMap from fluent-bit-config to fluent-bit-config-filtered by editing the metadata.name field.

Save and close the file.

To make the update, change the daemonset to use a different ConfigMap that contains your chosen filters. Use Kubernetes rolling updates feature and preserve the old version of the ConfigMap.

See Update the Fluent Bit daemonset to use the new configuration

carstenthiel-te

Hi,

thanks for the answer. Will this be reverted to the google default with the next update?

To clarify, we have not made any changes ourselves and do want to remain with the upstream version. We are simply confused as to why this problem suddenly appeared with the latest update.

Carsten

alnino

Hi there,

Same problem here. We see one error following by two warnings.

jsonPayload: {
  message: "cannot parse '0430 14:00:29.235176' after %L"
  plugin: "parser"
}
logName: "projects/************/logs/fluentbit"
severity: "ERROR"

jsonPayload: {
  message: "invalid time format %m%d %H:%M:%S.%L%z for '0430 14:00:29.234656'"
  plugin: "parser:glog"
}
logName: "projects/************/logs/fluentbit"
severity: "WARNING"

jsonPayload: {
  message: "Failed to get record: decoder: failed to decode payload: EOF"
  pid: "1"
}
labels: {
  compute.googleapis.com/resource_name: "gke-********-pool-******"
  k8s-pod/component: "fluentbit-gke"
  k8s-pod/controller-revision-hash: "****"
  k8s-pod/k8s-app: "fluentbit-gke"
  k8s-pod/kubernetes_io/cluster-service: "true"
  k8s-pod/pod-template-generation: "2"
}
logName: "projects/*******/logs/stderr"
severity: "WARNING"

Our cluster is compose by 3 node e2-standard-4, 1.25.7-gke.1000 and fluent-bit ConfigMap version is 1.2.0.

DaemonSet:

gke.gcr.io/fluent-bit@v1.8.12-gke.6 - sha256:436f3b7a3852
gke.gcr.io/fluentbit-gke-exporter@v0.23.4-kgke.0 - sha256:14c07f08e781

Currently, this problem is logging about 16000 entries more than common cluster logs, by hour.

Just like Carsten, we have not made any changes ourselves. Could this be resolved in the next update?

We have one more cluster with v1.24.11-gke.1000 and fluent-bit ConfigMap version 1.1.5. We have compared glog parser which is failing and both versions have the same Time_Format value. It means (or we can conclude at leats) what has changed is the component which is logging the date (the datetime format).

thiagoscodeler

I'm facing the same issue:

"message": "W0512 17:43:13.323239 1 server.go:103] Failed to get record: decoder: failed to decode payload: EOF"

alnino

Hi there,

We have been working a little bit more on this.

Even though we were able to create a copy of the fluent-bit's configmap, we couldn't update the daemonset. The daemonset and original configmap have a label which delegates the control on them to K8s Addon Manager, this manager can "block" any updating attempt, we updated the spec many times just to find the daemonset had the same configmap configured again.

This is the label:

labels:
    addonmanager.kubernetes.io/mode: Reconcile

What we were trying to change on the configmap was glog's Time_format from:

Time_Format %m%d %H:%M:%S.%L%z --> Time_Format %m%d %H:%M:%S.%f

This should have been enough to fix the problem.

We were reading a bit about Addon Manager and apparently, changing its configuration or behavior requires change node images. Another alternative is (to the best of our understanding), uninstall fluent-bit and install a whole custom configuration.

At this point, we are wondering if recreating our cluster with a previous gke version could be the best option.

thiagoscodeler

@alnino This is the customized configmap I'm using right now and getting the same error:

kind: ConfigMap
apiVersion: v1
metadata:
  name: es-fluentd-configmap
  namespace: logging
data:
  system.conf: |-
    <system>
      root_dir /tmp/fluentd-buffers/
    </system>
  containers.input.conf: |-
    <source>
      @id fluentd-containers.log
      @type tail                              # Fluentd built-in 
      path /var/log/containers/*.log          # location of container log
      pos_file /var/log/es-containers.log.pos
      tag raw.kubernetes.*                    # setup log tag
      read_from_head true
      exclude_path "#{ENV['FLUENT_CONTAINER_TAIL_EXCLUDE_PATH'] || use_default}"
      <parse>                                 
        @type multi_format                    # apply multi-format-parser 
        <pattern>
          format json                         # transform log to json format
          time_key time                       
          time_format %Y-%m-%dT%H:%M:%S.%NZ   
        </pattern>
        <pattern>
          format /^(?<time>.+) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<level>[^ ]*)\|(?<logtime>[^\]]*)\|(?<app_name>[^ ]*)\|(?<log_type>[^ ]*)\|(?<program>[^ ]*)\|(?<function>[^ ]*)\|(?<log>.*)$/  # parse log with custom log format
          time_format %Y-%m-%dT%H:%M:%S.%N%:z
        </pattern>
        <pattern>
          format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/ # if the format does not fit the above rule, parse with the default one
          time_format %Y-%m-%dT%H:%M:%S.%N%:z
        </pattern>
      </parse>
    </source>
    
    <match raw.kubernetes.**>           
      @id raw.kubernetes
      @type detect_exceptions           
      remove_tag_prefix raw             
      message log                       
      stream stream                     
      multiline_flush_interval 10
      max_bytes 500000
      max_lines 1000
    </match>
    <filter **>
      @id filter_concat
      @type concat                
      key message
      multiline_end_regexp /\n$/  
      separator ""
    </filter>
 
    <filter kubernetes.**>
      @id filter_kubernetes_metadata
      @type kubernetes_metadata
    </filter>
    <filter kubernetes.**>
      @id filter_parser
      @type parser                
      key_name log                
      reserve_data true           
      remove_key_name_field true  
      <parse>
        @type multi_format
        <pattern>
          format json
        </pattern>
        <pattern>
          format none
        </pattern>
      </parse>
    </filter>

    <filter kubernetes.**>
      @type record_transformer
      remove_keys $.docker.container_id,$.kubernetes.container_image_id,$.kubernetes.pod_id,$.kubernetes.namespace_id,$.kubernetes.master_url,$.kubernetes.labels.pod-template-hash
    </filter>
 
  forward.input.conf: |-
    <source>
      @id forward
      @type forward
    </source>
  output.conf: |-
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
      port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
      scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
      ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
      ssl_version "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERSION'] || 'TLSv1_2'}"
      user "#{ENV['FLUENT_ELASTICSEARCH_USER'] || use_default}"
      password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD'] || use_default}"
      index_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_INDEX_NAME'] || 'logstash'}"
      logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
      logstash_format true
      request_timeout    30s
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>

Time formats:

time_format %Y-%m-%dT%H:%M:%S.%NZ

time_format %Y-%m-%dT%H:%M:%S.%N%:z

garisingh

We've fixed this issue internally and will start rolling it out over the next couple of weeks.
If you have not opened a support case, I'd suggest doing so in order to better track the release for the fix.

JarleAnfinsen

We started seeing this issue on Friday (9 June) without having done any changes to the cluster ourselves. Initially (on Friday) on version 1.25.8, we have tried upgrading to 1.26.4 today to see if that would fix the problem, but the problem remains. We are getting 100.000+ warnings and errors pr. hour because of this.

The problem can be easily reproduced - simply create a new cluster in console. Select Standard mode, static version 1.26.4-gke.1400. Leave everything else at default. After the cluster has been created, the logs will start filling up with the "invalid time format" and "cannot parse" errors.

thiagoscodeler

This error is happening only for kube-system namespace for me. Is there any alternative to disable logging from this namespace?

{
insertId: "kxlksa023nds"
jsonPayload: {
message: "Failed to get record: decoder: failed to decode payload: EOF"
pid: "1"
}
labels: {6}
logName: "projects/project-name/logs/stderr"
receiveTimestamp: "2023-06-13T10:45:57.760970307Z"
resource: {
labels: {
cluster_name: "my-cluster"
container_name: "fluentbit-gke"
namespace_name: "kube-system"
pod_name: "fluentbit-gke-abgkw"
project_id: "my-project"
}
type: "k8s_container"
}
severity: "WARNING"
sourceLocation: {
file: "server.go"
line: "103"
}
timestamp: "2023-06-13T10:45:52.730976027Z"
}

garisingh

It might take quite some time before we are able to roll out a fix for this. In the meantime, the recommended action to take is to create an exclusion filter on the "_Default" Log Router Sink.

You can find the log routers here. In the list, you'll see a bucket/sink name "_Default".

Click the three dots to the right and select "Edit Sink"
Scroll down to "Choose logs to filter out of sink (optional)"
Click "+ ADD EXCLUSION"
Name it something like "FluentBit Parser Errors"

In the filter box, enter the following query:

LOG_ID("fluentbit")
(jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog"))

Click "Update Sink"

robinw

Additional exclusion filter to remove "Failed to get record: decoder: failed to decode payload: EOF" log:

(
    LOG_ID("stderr")
    (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF$" AND resource.labels.container_name = "fluentbit-gke")
)
OR
(
    LOG_ID("fluentbit")
    (jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog"))
)

garisingh

Also, if you have not yet filed a bug through Google Cloud support, please chime in on

https://issuetracker.google.com/issues/287206106

xploit

Sadly I was not lucky enough to notice it in time, it quickly built up to billions of log entries

bkonicek

Is there any update on the fix for this being rolled out? I opened a support request but they gave me the same recommendation for creating an exclusion rule.

sk-karl

We just got hit by this as of a few days ago out of the blue, which is basically the opposite of a fix being rolled out 😏

Anyways, in case you're like me and loathe non-automated GCP settings, here's how to do the logging exclusion filter with Terraform:

resource "google_logging_project_exclusion" "fluentbit_spam_reduction" {
  name = "fluentbit-spam-reduction"
  description = "Fluentbit spam reduction workaround"
  filter = <<-EOT
    LOG_ID("fluentbit")
    (
      jsonPayload.message =~ "^cannot parse .* after %L$"
      OR (
        jsonPayload.message =~ "^invalid time format"
        AND jsonPayload.plugin = "parser:glog"
      )
    )
  EOT
}

(Be sure to configure the Terraform Google provider with the correct project and whatnot; not shown here)

fgke

Does anyone have a filter expression for "Failed to get record: decoder" warning?

thiagoscodeler

LOG_ID("stderr")

(jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF$")

Claudio-BHA

Sorry for the maybe stupid question but, with a bug report open and clearly an issue on a system level and not just due to user's error, shouldn't we get a refund for the amount billed due to this error?

We have had our monthly bill increase of roughly 33% and we were lucky because we only started having this error spam from the 26th of June and we noticed right now.

I don't want to think if we had this error starting towards the start of the month.

Can someone clarify to me, if that's not the case, why?

xploit

I really hope so, it's just not fair charging users for something that was not our fault. In my case I used autopilot, the default option that was supposed to configure everything right OOB

JHutch

Same I would like to know if this log usage is going to be excluded from billing.

garisingh

STATUS UPDATE:

You can track the status of the fix here

garisingh

The fix should have propagated to all clusters in the fleet except for those which have maintenance windows or exclusions. They will be updated during the next available window set.

danek

why is the start time for the incident recorded as July 6 when we know people were reporting it since May? My own logs show it started well before July 6th. This wasn't a trivial bug, it cost us several hundred thousand $.

garisingh

@danek Did you open a support case for your issue?