DNS response routing failure in GKE multi-network ...

barakota · 09-11-2024 06:09 AM

Hi,

I'm facing an issue with GKE multi-network cluster where DNS query from dnsutils pod fails with timeout.

I attach the diagram below, for better understanding:

From the traces, it looks like that the host is unable to route DNS responses back to kubedns pod. Asymmetric routing could be one potential reason for this failure.

(1) Execute "dig cloudflare.com." from dnsutlis pod
root@dnsutils:/# dig cloudflare.com.
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> cloudflare.com.

(2) tcpdump on kubedns eth0 showing:
a) DNS request packet received from dnsutils pod (10.10.0.16)
b) DNS request new packet destined to (169.254.169.254)

Nameserver is based on kubedns resolve.conf:

kube-dns-86df8454bc-jg2dp:~# cat /etc/resolv.conf
search europe-central2-a.c.civic-sunrise-419116.internal c.civic-sunrise-419116.internal google.internal
nameserver 169.254.169.254
nameserver 169.254.169.254
nameserver 169.254.169.254

(3) tcpdump on veth interface gke2c74913f72d showing the same packets forwarded between veth pair interface gke2c74913f72d & kubedns eth0

Packets forward to veth pair as the default gateway, according to the routing config on kubedns pod:
kube-dns-86df8454bc-jg2dp:~# ip r
default via 10.10.0.1 dev eth0 mtu 1460
10.10.0.0/26 via 10.10.0.1 dev eth0 src 10.10.0.10 mtu 1460
10.10.0.1 dev eth0 scope link src 10.10.0.10 mtu 1460
kube-dns-86df8454bc-jg2dp:~#
kube-dns-86df8454bc-jg2dp:~# arp
? (10.10.0.1) at 2a:db:1b:85:2a:61 [ether] on eth0
kube-dns-86df8454bc-jg2dp:~#

(4) tcpdump showing DNS query packets egress host via interface ens6 destined to (169.254.169.254)

Routing decison here is based on host route table:

root@gke-gke-cluster-node-pool-baea6134-2wv5:~# ip r
169.254.169.254 dev ens6 proto dhcp scope link src 192.168.4.2 metric 100
169.254.169.254 via 192.168.10.1 dev ens4 proto dhcp src 192.168.10.5 metric 100
169.254.169.254 dev ens5 proto dhcp scope link src 192.168.1.2 metric 100

root@gke-gke-cluster-node-pool-baea6134-2wv5:~# arp
Address HWtype HWaddress Flags Mask Iface
metadata.google.interna ether 42:01:c0:a8:01:01 C ens5
metadata.google.interna ether 42:01:c0:a8:04:01 C ens6

(5) tcpdump showing DNS response packets ingress via host interface ens4 destined to kubedns (10.10.0.10)

These packets are not reaching veth interface gke2c74913f72d

Despite routing configuraiton on host:

root@gke-cp-gke-cluster-node-pool-5gc-baea6134-2wv5:~# ip r
10.10.0.10 dev gke2c74913f72d scope link

root@gke-cp-gke-cluster-node-pool-5gc-baea6134-2wv5:~# arp
Address HWtype HWaddress Flags Mask Iface
10.10.0.10 ether 4e:e0:55:25:a6:5c C gke2c74913f72d

(6) DNS query fails with timeout
root@dnsutils:/# dig cloudflare.com.
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> cloudflare.com.
;; global options: +cmd
;; connection timed out; no servers could be reached
root@dnsutils:/#

It can be seen that DNS query/response packets assymetric routing, could this be the issue ? altought I can see that the dns response via ens4 is destined to the correct IP and port 10.10.0.10.52082, that's why I'm not certian that it is the root cause.

ip a, ip r & arp outputs from the pods and the host + IP tables for the host and tc filter config for the interfaces are also available.

Thank in advance,
Br,
Amr

diannemcm

Hi @barakota,

Welcome to Google Cloud Community!

It appears that a more thorough investigation for your project is necessary. For more detailed insights you may reach out to Google Cloud Support for assistance. Their team has specialized expertise in diagnosing underlying problems. When contacting them, provide all your comprehensive details and screenshots. This will help them better understand and address your issue.

I hope this helps!

DNS response routing failure in GKE multi-network cluster