For the past week we've been seeing a massive increase of fluentbit log parsing errors in Log Explorer. They are being reported on node level in the order of several thousand entries per hour.
jsonPayload: {
message: "cannot parse '0503 23:59:59.678415' after %L"
plugin: "parser"
}
logName: "projects/***************/logs/fluentbit"
receiveTimestamp: "2023-05-04T00:00:04.123812634Z"
resource: {
labels: {
cluster_name: "cluster"
location: "europe-north1-a"
node_name: "************************"
project_id: "************************"
}
type: "k8s_node"
}
This is on Kubernetes 1.25.8-gke.500 with e2-standard-4 nodes and ConfigMap version fluentbit-gke-config-v1.2.0 (GKE default Fluent Bit v1.8.12 installation)
Hello @carstenthiel-te,
Welcome to Google Cloud Community!
You can configure the Fluent Bit to filter certain information to be logged.
- Open the
kubernetes/fluentbit-configmap.yaml
file in an editor.- Uncomment the lines after
### sample log scrubbing filters
and before### end sample log scrubbing filters
.- Change the name of the ConfigMap from
fluent-bit-config
tofluent-bit-config-filtered
by editing themetadata.name
field.- Save and close the file.
To make the update, change the daemonset to use a different ConfigMap that contains your chosen filters. Use Kubernetes rolling updates feature and preserve the old version of the ConfigMap.
See Update the Fluent Bit daemonset to use the new configuration
Hi,
thanks for the answer. Will this be reverted to the google default with the next update?
To clarify, we have not made any changes ourselves and do want to remain with the upstream version. We are simply confused as to why this problem suddenly appeared with the latest update.
Carsten
Hi there,
Same problem here. We see one error following by two warnings.
jsonPayload: {
message: "cannot parse '0430 14:00:29.235176' after %L"
plugin: "parser"
}
logName: "projects/************/logs/fluentbit"
severity: "ERROR"
jsonPayload: {
message: "invalid time format %m%d %H:%M:%S.%L%z for '0430 14:00:29.234656'"
plugin: "parser:glog"
}
logName: "projects/************/logs/fluentbit"
severity: "WARNING"
jsonPayload: {
message: "Failed to get record: decoder: failed to decode payload: EOF"
pid: "1"
}
labels: {
compute.googleapis.com/resource_name: "gke-********-pool-******"
k8s-pod/component: "fluentbit-gke"
k8s-pod/controller-revision-hash: "****"
k8s-pod/k8s-app: "fluentbit-gke"
k8s-pod/kubernetes_io/cluster-service: "true"
k8s-pod/pod-template-generation: "2"
}
logName: "projects/*******/logs/stderr"
severity: "WARNING"
Our cluster is compose by 3 node e2-standard-4, 1.25.7-gke.1000 and fluent-bit ConfigMap version is 1.2.0.
DaemonSet:
Currently, this problem is logging about 16000 entries more than common cluster logs, by hour.
Just like Carsten, we have not made any changes ourselves. Could this be resolved in the next update?
We have one more cluster with v1.24.11-gke.1000 and fluent-bit ConfigMap version 1.1.5. We have compared glog parser which is failing and both versions have the same Time_Format value. It means (or we can conclude at leats) what has changed is the component which is logging the date (the datetime format).
I'm facing the same issue:
Hi there,
We have been working a little bit more on this.
Even though we were able to create a copy of the fluent-bit's configmap, we couldn't update the daemonset. The daemonset and original configmap have a label which delegates the control on them to K8s Addon Manager, this manager can "block" any updating attempt, we updated the spec many times just to find the daemonset had the same configmap configured again.
This is the label:
labels:
addonmanager.kubernetes.io/mode: Reconcile
What we were trying to change on the configmap was glog's Time_format from:
@alnino This is the customized configmap I'm using right now and getting the same error:
kind: ConfigMap
apiVersion: v1
metadata:
name: es-fluentd-configmap
namespace: logging
data:
system.conf: |-
<system>
root_dir /tmp/fluentd-buffers/
</system>
containers.input.conf: |-
<source>
@id fluentd-containers.log
@type tail # Fluentd built-in
path /var/log/containers/*.log # location of container log
pos_file /var/log/es-containers.log.pos
tag raw.kubernetes.* # setup log tag
read_from_head true
exclude_path "#{ENV['FLUENT_CONTAINER_TAIL_EXCLUDE_PATH'] || use_default}"
<parse>
@type multi_format # apply multi-format-parser
<pattern>
format json # transform log to json format
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<level>[^ ]*)\|(?<logtime>[^\]]*)\|(?<app_name>[^ ]*)\|(?<log_type>[^ ]*)\|(?<program>[^ ]*)\|(?<function>[^ ]*)\|(?<log>.*)$/ # parse log with custom log format
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/ # if the format does not fit the above rule, parse with the default one
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
</parse>
</source>
<match raw.kubernetes.**>
@id raw.kubernetes
@type detect_exceptions
remove_tag_prefix raw
message log
stream stream
multiline_flush_interval 10
max_bytes 500000
max_lines 1000
</match>
<filter **>
@id filter_concat
@type concat
key message
multiline_end_regexp /\n$/
separator ""
</filter>
<filter kubernetes.**>
@id filter_kubernetes_metadata
@type kubernetes_metadata
</filter>
<filter kubernetes.**>
@id filter_parser
@type parser
key_name log
reserve_data true
remove_key_name_field true
<parse>
@type multi_format
<pattern>
format json
</pattern>
<pattern>
format none
</pattern>
</parse>
</filter>
<filter kubernetes.**>
@type record_transformer
remove_keys $.docker.container_id,$.kubernetes.container_image_id,$.kubernetes.pod_id,$.kubernetes.namespace_id,$.kubernetes.master_url,$.kubernetes.labels.pod-template-hash
</filter>
forward.input.conf: |-
<source>
@id forward
@type forward
</source>
output.conf: |-
<match **>
@id elasticsearch
@type elasticsearch
@log_level info
include_tag_key true
host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
ssl_version "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERSION'] || 'TLSv1_2'}"
user "#{ENV['FLUENT_ELASTICSEARCH_USER'] || use_default}"
password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD'] || use_default}"
index_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_INDEX_NAME'] || 'logstash'}"
logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
logstash_format true
request_timeout 30s
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>
Time formats:
time_format %Y-%m-%dT%H:%M:%S.%NZ
time_format %Y-%m-%dT%H:%M:%S.%N%:z
time_format %Y-%m-%dT%H:%M:%S.%N%:z
We've fixed this issue internally and will start rolling it out over the next couple of weeks.
If you have not opened a support case, I'd suggest doing so in order to better track the release for the fix.
We started seeing this issue on Friday (9 June) without having done any changes to the cluster ourselves. Initially (on Friday) on version 1.25.8, we have tried upgrading to 1.26.4 today to see if that would fix the problem, but the problem remains. We are getting 100.000+ warnings and errors pr. hour because of this.
The problem can be easily reproduced - simply create a new cluster in console. Select Standard mode, static version 1.26.4-gke.1400. Leave everything else at default. After the cluster has been created, the logs will start filling up with the "invalid time format" and "cannot parse" errors.
This error is happening only for kube-system namespace for me. Is there any alternative to disable logging from this namespace?
{
insertId: "kxlksa023nds"
jsonPayload: {
message: "Failed to get record: decoder: failed to decode payload: EOF"
pid: "1"
}
labels: {6}
logName: "projects/project-name/logs/stderr"
receiveTimestamp: "2023-06-13T10:45:57.760970307Z"
resource: {
labels: {
cluster_name: "my-cluster"
container_name: "fluentbit-gke"
namespace_name: "kube-system"
pod_name: "fluentbit-gke-abgkw"
project_id: "my-project"
}
type: "k8s_container"
}
severity: "WARNING"
sourceLocation: {
file: "server.go"
line: "103"
}
timestamp: "2023-06-13T10:45:52.730976027Z"
}
It might take quite some time before we are able to roll out a fix for this. In the meantime, the recommended action to take is to create an exclusion filter on the "_Default" Log Router Sink.
You can find the log routers here. In the list, you'll see a bucket/sink name "_Default".
LOG_ID("fluentbit")
(jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog"))
Additional exclusion filter to remove "Failed to get record: decoder: failed to decode payload: EOF" log:
(
LOG_ID("stderr")
(jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF$" AND resource.labels.container_name = "fluentbit-gke")
)
OR
(
LOG_ID("fluentbit")
(jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog"))
)
Also, if you have not yet filed a bug through Google Cloud support, please chime in on
Sadly I was not lucky enough to notice it in time, it quickly built up to billions of log entries
Is there any update on the fix for this being rolled out? I opened a support request but they gave me the same recommendation for creating an exclusion rule.
We just got hit by this as of a few days ago out of the blue, which is basically the opposite of a fix being rolled out 😏
Anyways, in case you're like me and loathe non-automated GCP settings, here's how to do the logging exclusion filter with Terraform:
resource "google_logging_project_exclusion" "fluentbit_spam_reduction" {
name = "fluentbit-spam-reduction"
description = "Fluentbit spam reduction workaround"
filter = <<-EOT
LOG_ID("fluentbit")
(
jsonPayload.message =~ "^cannot parse .* after %L$"
OR (
jsonPayload.message =~ "^invalid time format"
AND jsonPayload.plugin = "parser:glog"
)
)
EOT
}
(Be sure to configure the Terraform Google provider with the correct project and whatnot; not shown here)
Does anyone have a filter expression for "Failed to get record: decoder" warning?
Sorry for the maybe stupid question but, with a bug report open and clearly an issue on a system level and not just due to user's error, shouldn't we get a refund for the amount billed due to this error?
We have had our monthly bill increase of roughly 33% and we were lucky because we only started having this error spam from the 26th of June and we noticed right now.
I don't want to think if we had this error starting towards the start of the month.
Can someone clarify to me, if that's not the case, why?
I really hope so, it's just not fair charging users for something that was not our fault. In my case I used autopilot, the default option that was supposed to configure everything right OOB
Same I would like to know if this log usage is going to be excluded from billing.
STATUS UPDATE:
You can track the status of the fix here
The fix should have propagated to all clusters in the fleet except for those which have maintenance windows or exclusions. They will be updated during the next available window set.
why is the start time for the incident recorded as July 6 when we know people were reporting it since May? My own logs show it started well before July 6th. This wasn't a trivial bug, it cost us several hundred thousand $.
@danek Did you open a support case for your issue?