Hi
We have https load balancer with public IP address, behind which we have 13 backends. We have couple of hundreds of devices maintaining live grpc connections (streaming), that are maintained 24/7. They also from time to time (once in a minute) send unary grpc requests to one of the services behind load balancer. And, once in a while (like once per hour), this request fails with "dial tcp <IP> io timeout", something like this (and IP is public IP of the load balancer). Either route to the destination is failing, (like disappearing route), or maybe load balancer received TCP handshake but is not responding. However, from the same device on the next try we can successfully establish connection and submit request properly. We were trying to investigate cause for this, why load balancer sometimes does not respond for TCP connection. We discovered some issues:
* Each device was accumulating connections and we had like 250 per device. Coupled with 300 devices, we could be in 100K connections count to load balancer. We are going to fix missing "close" call, but still, I have some questions I dont know how to answer: Is there a limit of TCP connections load balancer can maintain with downstream? How it can fare if we have 100K devices with couple of connections each? I did not find any limits for this, but maybe there are? If there is a limit, how to monitor number of active TCP connections (from internet to load balancer)? I looked at load balancer metrics, but I did not find number of active connections or any error rate. I could only find some counters for HTTP requests, but nothing on the lower TCP level.
Furthermore, we tried to monitor tcp session from some client to the load balancer, it seems missing SYN ACK packets:
11:20:27.067694 eth0 Out IP 192.168.1.3.42430 > 34.110.188.250.443: Flags [S], seq 1224930626, win 64240, options [mss 1460,sackOK,TS val 3856964254 ecr 0,nop,wscale 7], length 0 11:20:27.286058 eth0 Out IP 192.168.1.3.55712 > 34.110.188.250.443: Flags [S], seq 4214507581, win 64240, options [mss 1460,sackOK,TS val 3856964472 ecr 0,nop,wscale 7], length 0 11:20:28.311710 eth0 Out IP 192.168.1.3.55712 > 34.110.188.250.443: Flags [S], seq 4214507581, win 64240, options [mss 1460,sackOK,TS val 3856965498 ecr 0,nop,wscale 7], length 0 11:20:28.812791 eth0 Out IP 192.168.1.3.55716 > 34.110.188.250.443: Flags [S], seq 1942981962, win 64240, options [mss 1460,sackOK,TS val 3856965999 ecr 0,nop,wscale 7], length 0 11:20:28.891702 eth0 Out IP 192.168.1.3.49576 > 34.110.188.250.443: Flags [S], seq 1242130369, win 64240, options [mss 1460,sackOK,TS val 3856966078 ecr 0,nop,wscale 7], length 0 11:20:29.815707 eth0 Out IP 192.168.1.3.55716 > 34.110.188.250.443: Flags [S], seq 1942981962, win 64240, options [mss 1460,sackOK,TS val 3856967002 ecr 0,nop,wscale 7], length 0 11:20:30.135687 eth0 Out IP 192.168.1.3.42700 > 34.110.188.250.443: Flags [S], seq 697413040, win 64240, options [mss 1460,sackOK,TS val 3856967322 ecr 0,nop,wscale 7], length 0 11:20:30.327695 eth0 Out IP 192.168.1.3.55712 > 34.110.188.250.443: Flags [S], seq 4214507581, win 64240, options [mss 1460,sackOK,TS val 3856967514 ecr 0,nop,wscale 7], length 0 11:20:30.395680 eth0 Out IP 192.168.1.3.42712 > 34.110.188.250.443: Flags [S], seq 4039898438, win 64240, options [mss 1460,sackOK,TS val 3856967582 ecr 0,nop,wscale 7], length 0 11:20:31.748275 eth0 Out IP 192.168.1.3.55724 > 34.110.188.250.443: Flags [S], seq 515854204, win 64240, options [mss 1460,sackOK,TS val 3856968934 ecr 0,nop,wscale 7], length 0 11:20:31.831702 eth0 Out IP 192.168.1.3.55716 > 34.110.188.250.443: Flags [S], seq 1942981962, win 64240, options [mss 1460,sackOK,TS val 3856969018 ecr 0,nop,wscale 7], length 0 11:20:32.759694 eth0 Out IP 192.168.1.3.55724 > 34.110.188.250.443: Flags [S], seq 515854204, win 64240, options [mss 1460,sackOK,TS val 3856969946 ecr 0,nop,wscale 7], length 0 11:20:32.951724 eth0 Out IP 192.168.1.3.49576 > 34.110.188.250.443: Flags [S], seq 1242130369, win 64240, options [mss 1460,sackOK,TS val 3856970138 ecr 0,nop,wscale 7], length 0 11:20:33.207711 eth0 Out IP 192.168.1.3.33992 > 34.110.188.250.443: Flags [S], seq 584998346, win 64240, options [mss 1460,sackOK,TS val 3856970394 ecr 0,nop,wscale 7], length 0 11:20:33.207814 eth0 Out IP 192.168.1.3.35282 > 34.110.188.250.443: Flags [S], seq 3837891136, win 64240, options [mss 1460,sackOK,TS val 3856970394 ecr 0,nop,wscale 7], length 0 11:20:34.487715 eth0 Out IP 192.168.1.3.55712 > 34.110.188.250.443: Flags [S], seq 4214507581, win 64240, options [mss 1460,sackOK,TS val 3856971674 ecr 0,nop,wscale 7], length 0 11:20:34.775699 eth0 Out IP 192.168.1.3.55724 > 34.110.188.250.443: Flags [S], seq 515854204, win 64240, options [mss 1460,sackOK,TS val 3856971962 ecr 0,nop,wscale 7], length 0 11:20:35.511697 eth0 Out IP 192.168.1.3.35324 > 34.110.188.250.443: Flags [S], seq 3555126586, win 64240, options [mss 1460,sackOK,TS val 3856972698 ecr 0,nop,wscale 7], length 0 11:20:36.023710 eth0 Out IP 192.168.1.3.55716 > 34.110.188.250.443: Flags [S], seq 1942981962, win 64240, options [mss 1460,sackOK,TS val 3856973210 ecr 0,nop,wscale 7], length 0 11:20:38.839701 eth0 Out IP 192.168.1.3.55724 > 34.110.188.250.443: Flags [S], seq 515854204, win 64240, options [mss 1460,sackOK,TS val 3856976026 ecr 0,nop,wscale 7], length 0 11:20:39.351710 eth0 Out IP 192.168.1.3.34014 > 34.110.188.250.443: Flags [S], seq 809291829, win 64240, options [mss 1460,sackOK,TS val 3856976538 ecr 0,nop,wscale 7], length 0
If it is missing TCP SYN, how do we diagnose this? Thanks
Hi @aceligowska ,
There is no specific limit on the number of TCP connections that a load balancer can handle, however, the actual capacity and performance of the load balancer depend on factors such as the type of load balancer, its configuration, and the resources allocated to it.
You can monitor the number of active TCP connections by following these steps:
If you are observing missing TCP SYN-ACK packets, it could be indicative of an issue in the TCP handshake process between the client and the load balancer. You may check the following:
1. Check the health of the backend servers. If the backend servers are not responding or are experiencing issues, the load balancer may not receive the expected SYN-ACK responses.
2. Verify the configuration of your load balancer. Make sure that it is correctly configured to handle incoming TCP connections and is forwarding traffic to the appropriate backend servers.
3. I recommend consulting with GCP support if needed.