Re: Pod is unable to reply UDP packets larger than...

archean · 03-11-2023 05:16 AM

My pod is unable to respond to UDP packets larger than 1432 bytes when accessed through load balancing, but it can respond to requests from within the cluster.

To simplify the situation, I created a server and client, where the server listens for requests on UDP port 61000 and returns the data doubled to the client.

When I make requests of 717 bytes from the client pod, which results in a server response packet of 1434 bytes (just 2 bytes larger than the maximum single packet due to our VPC MTU setting at 1460), there are no issues.
However, when I make requests through the load balancer from outside of the cluster, packets occasionally return, but most of the time they do not.

I suspect this issue may be related to MTU and IP fragmentation but cannot find a solution. Any assistance would be greatly appreciated!

When accessing through load balancing, I captured packets on the server side and found that the packets were correctly fragmented and sent, but could not be returned to the client. The captured packet looks no different from when I accessed within the cluster. Below are Wireshark screenshots of two accesses, one from inside the cluster and one from outside:

Marvin_Lucero

Hi @archean ,

Is this cluster enabled with the Dataplane V2? If yes, can you run the following commands?

To verify if this is the case, it would be good to run the following command on the pod of the node that is actively seeing failures -

`cilium monitor -v`

If that has too much data, and is hard to parse, it may be better to run below command.

`cilium monitor -t drop`

The above commands will provide us to confirm if the issue is with the DPv2 layer dropping packets. So, please execute the commands in the active pod which is observed with packet drops and provide the logs.

archean

Hi Marvin,

Thank you very much for your response. After running the command “cilium monitor -t drop”, I conducted some tests using the load balancer IP and obtained the following results:

xx drop (First logical datagram fragment not found) flow 0x0 to endpoint 0, , identity world->unknown: 207.65.235.217 -> 34.85.12.159

I have also discovered this information from previous Hubble observations:

Mar 12 14:54:41.319: 10.0.0.5:59003 (remote-node) -> default/server-6b67bd6ff6-x6vgj:61000 (ID:16516) to-endpoint FORWARDED (UDP)
Mar 12 14:54:41.319: 10.0.0.5 (remote-node) <- default/server-6b67bd6ff6-x6vgj (ID:16516) to-stack FORWARDED (IPv4)
Mar 12 14:54:41.320: default/server-6b67bd6ff6-x6vgj (world) <> 10.0.0.5 (host) First logical datagram fragment not found DROPPED (IPv4)

I am currently running a GKE cluster enabled with Dataplane V2. Based on my investigation, I suspect that the data might have been dropped in cilium. Could you please help me investigate this matter further? If needed, I am more than happy to provide additional information.

Pod is unable to reply UDP packets larger than 1432 bytes to requests accessed through LB