Re: VMs with private IPs and NAT have an extra ~40...

siddharthab · 08-09-2024 10:08 AM

[x-post from (URL Removed by Staff)]

I have been using Cloud Batch and discovered an issue that seems like should be more widely known or handled by Google already. I have been working the GCP support team for many weeks on this, but I also want to check with the community on whether people have missed the issue entirely or it's actually a minor issue not warranting attention.

On a VM with a COS or Debian disk image, no external IP, and a network configured with Cloud Router and NAT, if I try to connect to a remote server (can be any server outside of the Google network) in the startup script, the first connection takes 20-40s. If in my startup script, I just sleep for 40s first and then try to connect, the connection is immediate. The command I used to test was `curl -v --trace-time https://microsoft.com`.

This would normally not be an issue other than the increased startup latency, but several applications (like docker daemon) have lower connection timeouts and steps invoking these apps in the startup script would always fail unless they are put in a retry loop or if the connection timeout is somehow increased.

I would imagine that an extra 40s of network bootstrap delay would be a serious issue for Google to look into, or acknowledge more publicly as an issue.

azzi

Dear siddharthab,

I tried to do the same thing as you explained but mine works immediately. Can confirm is this what you meant?

My VM booted at
2024-08-10T01:15:55.968892+00:00 localhost kernel: [ 0.000000] Linux version 6.1.0-22-cloud-amd64 (<PII removed by staff>) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21)

My startup script started at
2024-08-10T01:16:02.653808+00:00 instance-20240810-011435 systemd[1]: Starting google-startup-scripts.service - Google Compute Engine Startup Scripts...

The CURL request was invoked at
2024-08-10T01:16:02.941644+00:00 instance-20240810-011435 google_metadata_script_runner[1798]: startup-script: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 001:16:02.044393 * Trying 20.112.250.133:443...

HTTP 301 were received at
2024-08-10T01:16:03.481354+00:00 instance-20240810-011435 google_metadata_script_runner[1798]: startup-script: 01:16:02.583901 < HTTP/2 301

CURL command finished at
2024-08-10T01:16:03.482145+00:00 instance-20240810-011435 google_metadata_script_runner[1798]: startup-script: 01:16:02.584685 * Connection #0 to host microsoft.com left intact

My question is, can you share the Cloud NAT settings? Something like this would suffice

Regards,
Izza

siddharthab

Hi azzi,

Thank you for looking into this. Attached are the details you requested. The only major differences I see between your settings and mine are the region and minimum number of ports per VM.

I tried the command again and I can still reproduce the delay. For me, the relevant log lines I see in journalctl are:

```
Aug 13 04:08:27 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:27.526668 * Trying [2603:1030:b:3::152]:443...
Aug 13 04:08:27 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:27.526803 * Immediate connect fail for 2603:1030:b:3::152: Network is unreachable
Aug 13 04:08:28 sidb-test-2 oslogin_cache_refresh[1840]: Refreshing group entry cache
Aug 13 04:08:28 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:28.528213 * Trying [2603:1030:20e:3::23c]:443...
Aug 13 04:08:28 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:28.528362 * Immediate connect fail for 2603:1030:20e:3::23c: Network is unreachable
Aug 13 04:08:29 sidb-test-2 google_metadata_script_runner[1794]: [150B blob data]
Aug 13 04:08:29 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:29.529597 * Immediate connect fail for 2603:1030:c02:8::14: Network is unreachable
Aug 13 04:08:30 sidb-test-2 google_metadata_script_runner[1794]: [152B blob data]
Aug 13 04:08:30 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:30.530191 * Immediate connect fail for 2603:1020:201:10::10f: Network is unreachable
Aug 13 04:08:31 sidb-test-2 google_metadata_script_runner[1794]: [148B blob data]
Aug 13 04:08:31 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:31.531782 * Immediate connect fail for 2603:1010:3:3::5b: Network is unreachable
Aug 13 04:08:43 sidb-test-2 google_metadata_script_runner[1794]: [1.0K blob data]
Aug 13 04:08:43 sidb-test-2 google_metadata_script_runner[1794]: startup-script: 04:08:42.795264 * ALPN: offers h2,http/1.1
```

azzi

Dear siddharthab,

Try to increase the minimum ports per VM instance. Higher than 200 is recommended, and try to reproduce the issue again. See whether there is changes or not.

Regards,
Izza

siddharthab

Hi Izza,

I am still able to see the issue after I increased the minimum number of ports to be 500. So maybe this issue is specific to the us-central1 region.

I ran the test 7 times and I was able to see the delay 5 out of the 7 times. For example:
```
Aug 14 09:24:22 sidb-test-20240814-092336 google_metadata_script_runner[1793]: Found startup-script in metadata.
Aug 14 09:24:22 sidb-test-20240814-092336 google_metadata_script_runner[1793]: startup-script: % Total % Received % Xferd Average Speed Time Time Time Current
Aug 14 09:24:22 sidb-test-20240814-092336 google_metadata_script_runner[1793]: startup-script: Dload Upload Total Spent Left Speed
Aug 14 09:24:22 sidb-test-20240814-092336 google_metadata_script_runner[1793]: [142B blob data]
Aug 14 09:24:54 sidb-test-20240814-092336 google_metadata_script_runner[1793]: [2.4K blob data]
Aug 14 09:24:54 sidb-test-20240814-092336 google_metadata_script_runner[1793]: startup-script: 09:24:53.938251 > GET / HTTP/1.1
```

azzi

Dear siddharthab,

What machine type did you use?

Regards,
Izza

siddharthab

I have seen this issue with e2 and n2 machine types with both low and high cpu counts.

siddharthab

Is there a specific machine type you would like me to try?

Terranca

Not sure if this helps or not, but I seem to experience something similar (in europe-west4) using serverless (cloud run) but also VPC+NAT. Sometimes the first outbound connection is really fast, sometimes there is a 15-20 seconds delay. I've tried everything imaginable but it seems the bootstrapping to either the VPC or the NAT is really slow. I can't wrap my head around it.

PS: What might help debugging is adding a "Allow all egress" firewall rule with logging enabled -- it allows you to see if there is a latency delay between the firewall seeing the request, the NAT allocating the port, and the request actually being handled (in your VM)

For reference; https://www.googlecloudcommunity.com/gc/Serverless/Long-delay-on-initial-outbound-connection-with-Ge...

VMs with private IPs and NAT have an extra ~40s network bootstrap delay