Solved: Re: Latest gke-metadata-server image causes influx...

ethansq · 09-06-2023 11:58 AM

Hello,

TLDR; Known issue with large volume of misleading logs originating from gke-metadata-server. See https://issuetracker.google.com/issues/297084117
If you are running into this issue, please go to the issue tracker link and click the vote impacted button (its the +1).

See replies below for relavent details, a lot of this post was trying to pinpoint the cause and determining if it was affecting anything (besides log billing).

Original Post

I have narrowed down the issue to a new gke-metadata-server image being responsible. I have replied with a comment about the image versions responsible below. Need help identifying and mitigating a sudden influx of warning logs saying "Pod 'xyz' not found after refresh" that started on September 4th and is affecting both Production and Test projects.

Environment Details:

GKE Version: 1.26.5-gke.2700
GKE Version: 1.26.6-gke.1700

On Sept 4th around 6:40am ET a large number of warning logs saying Pod 'xyz' not found after refresh started, e.g.:

{
  "insertId": "udi8yt01e2lzzokx",
  "jsonPayload": {
    "pid": "90",
    "message": "[background-id:05ec2c8f60778cd6] Pod \"kube-system/kube-proxy-gke-foobar-cluster-default-pool-38dabaaf-vhqd\" not found after refresh"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "namespace_name": "kube-system",
      "pod_name": "gke-metadata-server-bmcwb",
      "location": "europe-west3",
      "cluster_name": "foobar-cluster",
      "container_name": "gke-metadata-server",
      "project_id": "<project_id>"
    }
  },
  "timestamp": "2023-09-04T10:39:20.947448768Z",
  "severity": "WARNING",
  "labels": {
    "k8s-pod/pod-template-generation": "8",
    "compute.googleapis.com/resource_name": "gke-foobar-default-pool-38dabaaf-vhqd",
    "k8s-pod/addonmanager_kubernetes_io/mode": "Reconcile",
    "k8s-pod/controller-revision-hash": "675f54f544",
    "k8s-pod/k8s-app": "gke-metadata-server"
  },
  "logName": "projects/foobar/logs/stderr",
  "sourceLocation": {
    "file": "podfinder.go",
    "line": "199"
  },
  "receiveTimestamp": "2023-09-04T10:39:26.560291300Z"
}

This issue seems to be coming from every gke-metadata-server pod and the pod logging this will even report that it can't find itself after refresh. The gke-metadata-server is under the kube-system namespace and is not something we manage.

I've checked to see if Pods are restarting, made sure certificates are fine, checked most obvious things but haven't found anything that can explain this on our end so far.

Interestingly, this ails both our production project and our test project on the same day just a few hours difference. No changes were made on our end to our knowledge and resource request etc have been not obviously different from before and after the large volume of logs began.

I have been investigating whether or not this is effecting performance and if possible it was caused by us. ~~I did stumble upon some logs that seem to correlate somewhat:~~ (likely coincidental)

"MESSAGE": "I0904 13:32:47.915838    1794 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]"

~~Full Log JSON:~~

{
  "insertId": "33qvz8fyympkj1id",
  "jsonPayload": {
    "SYSLOG_IDENTIFIER": "kubelet",
    "_SYSTEMD_SLICE": "system.slice",
    "_HOSTNAME": "gke-test-foobar-test-e2-medium-po-82023331-ritp",
    "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
    "_SYSTEMD_UNIT": "kubelet.service",
    "_TRANSPORT": "stdout",
    "_STREAM_ID": "1526a4d00de34127b7d7be4895642b54",
    "_CAP_EFFECTIVE": "1ffffffffff",
    "_COMM": "kubelet",
    "PRIORITY": "6",
    "_PID": "1794",
    "_GID": "0",
    "_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=external --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --volume-plugin-dir=/home/kubernetes/flexvolume --node-status-max-images=25 --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --runtime-cgroups=/system.slice/containerd.service --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml \"--pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400\" --pod-infra-container-image=gke.gcr.io/pause:3.8@sha256:880e63f94b145e46f1b1082bb71b85e21f16b99b180b9996407d61240ceb9830",
    "_EXE": "/home/kubernetes/bin/kubelet",
    "_SYSTEMD_INVOCATION_ID": "472e94daf67b4def917ffa4c62b90c02",
    "SYSLOG_FACILITY": "3",
    "_UID": "0",
    "_BOOT_ID": "869a53ee4fb54089944498b26f3cbc53",
    "_MACHINE_ID": "bbbb6336ba3c7f4eeb5e78a6b8326fd9",
    "MESSAGE": "I0904 13:32:47.915838    1794 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gke-test-foobar-test-e2-medium-po-82023331-ritp",
      "project_id": "test-sq-env",
      "location": "europe-west3",
      "cluster_name": "foobar"
    }
  },
  "timestamp": "2023-09-04T13:32:47.915881Z",
  "logName": "projects/<project_id>/logs/kubelet",
  "receiveTimestamp": "2023-09-04T13:32:48.578704243Z"
}

Again, this is not related, just coincidental. Keeping this here in case its somehow useful.
Now I did see that this error shows up in the past well beyond Sept 4th but the timing of it on the test project is within seconds of when the warning messages began. What I found strange was that the path it fails to request from has a period right before the forward slash for routing when it makes the request:

http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg

{
    "insertId": "8qzqdkm5bquvjroe",
    "jsonPayload": {
      "MESSAGE": "I0904 13:32:47.927263    1794 credentialutil.go:63] \"Failed to read URL\" statusCode=404 URL=\"http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg\"",
      "_EXE": "/home/kubernetes/bin/kubelet",
      "_COMM": "kubelet",
      "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
      "_SYSTEMD_SLICE": "system.slice",
      "_BOOT_ID": "869a53ee4fb54089944498b26f3cbc53",
      "_CAP_EFFECTIVE": "1ffffffffff",
      "_HOSTNAME": "gke-test-foo-cluster-test-e2-medium-po-82023331-ritp",
      "_SYSTEMD_INVOCATION_ID": "472e94daf67b4def917ffa4c62b90c02",
      "_MACHINE_ID": "bbbb6336ba3c7f4eeb5e78a6b8326fd9",
      "SYSLOG_FACILITY": "3",
      "_STREAM_ID": "1526a4d00de34127b7d7be4895642b54",
      "_SYSTEMD_UNIT": "kubelet.service",
      "_TRANSPORT": "stdout",
      "_GID": "0",
      "_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=external --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --volume-plugin-dir=/home/kubernetes/flexvolume --node-status-max-images=25 --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --runtime-cgroups=/system.slice/containerd.service --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml \"--pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400\" --pod-infra-container-image=gke.gcr.io/pause:3.8@sha256:880e63f94b145e46f1b1082bb71b85e21f16b99b180b9996407d61240ceb9830",
      "PRIORITY": "6",
      "_PID": "1794",
      "_UID": "0",
      "SYSLOG_IDENTIFIER": "kubelet"
    },
    "resource": {
      "type": "k8s_node",
      "labels": {
        "location": "europe-west3",
        "node_name": "gke-test-foo-cluster-test-e2-medium-po-82023331-ritp",
        "cluster_name": "test-foo-cluster",
        "project_id": "<project_id>"
      }
    },
    "timestamp": "2023-09-04T13:32:47.927311Z",
    "logName": "projects/<project_id>/logs/kubelet",
    "receiveTimestamp": "2023-09-04T13:32:48.578704243Z"
  }

On our test project, I found one of the first nodes to have started with these errors had this log roughly 3.5 seconds before the Pod not found after refresh logs began en mass.

~~I'm not sure if this is just coincidence or side effect from the same issue (pods starting up or something?)~~ but I made a Log query to find all the warning logs saying Pod 'xyz' not found after refresh and then found all the logs that contained the URL with the period in it and interestingly enough (for me at the moment) there almost always seems to be the Pod 'xyz' not found after refresh (warning, yellow) messages in near proximity to the `Failed to read URL` (notice, blue) messages:

Test Project

Here is another picture of the histogram without the most recent surge on the 4th of Sept:Production Project

Query used for these pictures:

jsonPayload.message=~"[^ =\t\n\r\f\"\(\)\[\]\|']+\] Pod \"[^ =\t\n\r\f\"\(\)\[\]\|']+\" not found after refresh"
OR
jsonPayload.MESSAGE=~"http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg"

Here are the queries I've been using to narrow down logs related to this issue:

jsonPayload.message=~"[^ =\t\n\r\f\"\(\)\[\]\|']+\] Unable to sync sandbox \"([0-9a-fA-F]*([a-fA-F]+[0-9]+|[0-9]+[a-fA-F]+)[0-9a-fA-F]*)\": setup not complete: podUID=\"[^ =\t\n\r\f\"\(\)\[\]\|']+\", ip"
OR
jsonPayload.message=~"[^ =\t\n\r\f\"\(\)\[\]\|']+\] Pod \"[^ =\t\n\r\f\"\(\)\[\]\|']+\" not found after refresh"
OR
jsonPayload.message=~"[^ =\t\n\r\f\"\(\)\[\]\|']+\] Syncing pod \"[^ =\t\n\r\f\"\(\)\[\]\|']+"

CPU Usage for gke-metadata-server, can see all pods dropped off and all the new ones are using more CPU. No other workloads indicate this.

Thanks!

Search Keys:
Unable to sync sandbox info logs populating
Pod 'kube-system/xyz' Not Found After Refresh
gke-metadata-server spamming logs

garisingh

On the `gke-metadata-server` logs, I believe this is tied to https://issuetracker.google.com/issues/297084117

The messages themselves are actually benign (unless the pods actually don't come back). There's some work underway to reduce the severity of the messages from "Warning" to "Info" as well.

View solution in original post

ethansq

Image change for gke-metadata-server workload lines up with increased logs

previous image:

gke.gcr.io/gke-metadata-server:gke_metadata_server_20230308.00_p0@sha256:c6116d0b53bb58bfebb661a5aaaf18835d64b3db8a99b59a5556cee76277fb9e

current image:

gke.gcr.io/gke-metadata-server:gke_metadata_server_20230808.00_p0@sha256:c64b779853dfa571d5583df91fe77dde18b6e0b465df58077f6422bd8edc99e3

I also spotted this log which doesn't appear to exist prior to this image

jsonPayload.message="Got error when getting machine owner: failed to get proto for GMI path: open /etc/googlemachineidentity/live/machine_identity.pb: no such file or directory"

kevinqi

I'm seeing a similar issue-- same error message. Started happening yesterday, 9/6. I'm on the same GKE version and gke-metadata-server is on gke.gcr.io/gke-metadata-server:gke_metadata_server_20230808.00_p0 as well.

kevinqi

Same deal-- the gke-metadata-server image upgrade from 20230308 to 20230808 coincides with the timing of the issue.

garisingh

On the `gke-metadata-server` logs, I believe this is tied to https://issuetracker.google.com/issues/297084117

The messages themselves are actually benign (unless the pods actually don't come back). There's some work underway to reduce the severity of the messages from "Warning" to "Info" as well.

garisingh

@ethansq - I don't think the `dockercfg` issue is related to the logs ... unless it's what is causing pods to be delayed coming back up.

regisv

In case anyone is still having issues with this, the fix for us was to set `google-logging-use-fluentbit=true` in Compute Engine - Metadata.

neerajjain7587

After debugging after sometime,
I found above metadata-server Logs are only for those pods which runs on host network (i.e hostnetork = true)

Any update on this issue ?

Influx of logs "Unable to sync sandbox" an "Pod 'xyz' Not Found After Refresh"