Logs Explorer not working for Anthos workloads

ivan-aracki · 09-01-2022 01:50 AM

Sometimes and for some unknown reason not all application logs can be seen within Logs Explorer. We realised there is a pattern that its always logs from one specific ec2 worker node (we use AWS with Anthos).

Also, we realised if we restart fluentbit DaemonSet, logs will appear again. Sometimes all of the missing logs appear, but sometimes just partly.

Is this a known issue within Google Anthos with AWS...?

Regards,
Ivan

comaro

If this were an issue related to quotas, I wouldn't expect it to work after restarting fluentbit. Fluentbit is constantly trying to push data to the Cloud Logging endpoint and if it would exceed the quota on the GCP side, it should retry later.

Fluentbit itself doesn't have any quotas, but it could miss logs if there are hundreds of log entries being produced per second.

Also, if the issue happens only on a single EC2 instance, I wouldn't expect there to be any quota issues, as it would affect all nodes.

I'd suggest looking into the fluentbit logs once it gets into this state of failure. It might show you why the logs can't be exported.

bkauf

Hi Ivan, this is not a known issue, would you be able to file a support ticket and include the fluentbit logs?

ivan-aracki

Hi @bkauf where we should open the ticket regarding Anthos?

And also, do we need to attach only WARN & ERR logs of fluentbit during a similar time?

ivan-aracki

There are various warnings and errors we are getting like:

WARNINGS:

[output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[parser:appglog] invalid time format %m%d %H:%M:%S.%L%z for '0901 07:42:05.986349'
[engine] failed to flush chunk '1-1662017821.167899124.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=stackdriver.0 (out_id=0)

ERRORS:

[input:tail:tail.2] inode=777871 cannot register file /var/log/pods/gke-system_fluentbit-gke-75ptj_fe12bef2-180e-4f2e-86b3-f699f0e47470/fluentbit-gke/0.log (deleted)
[plugins/in_tail/tail_file.c:1311 errno=2] No such file or directory
[parser] cannot parse '0901 07:57:15.142765' after %L
[http_client] broken connection to logging.googleapis.com:443 ?

...and finally we get warnings like these:
[engine] service will shutdown in max 5 seconds