Solved: Re: Long delay on initial outbound connection with...

Terranca · 08-15-2024 11:49 PM

We are considering moving from Azure Container Apps to GCP Cloud Run due to the much better "scale to 0" boot times, and are running benchmarks, but are running into the following issue:

When we deploy our app in the Gen1 environment (we need fast cold boots) using Direct VPC Egress to a Shared VPC, with a Cloud NAT on the Shared VPC, there is a significant delay on the first outbound HTTP(S) connection of about 10-15 seconds.

We see in the logs that the Firewall is hit only ~8s after the request on average, and the Cloud NAT allocation is done only after ~10s on average. We've made sure there are enough allocated ports, we've tried both standard and premium network class IPs. What's more; it seems to be intermittent, and sometimes we do get ~1s cold boot and immediate outbound https connection.

When we remove the Direct VPS Egress from the Cloud Run instance, it resolves the issue and outbound https connections are available within 1 second; but we need Clout NAT for IP whitelisting.

~~This issue does not seem to exist on Gen2, but that of course leads to colder boot times in general.~~ Upon further investigation, this also happens on Gen2.

Is there a known issue with Cloud Run instances and Direct VPC Egress delays during cold boot?

Edit: nice illustration of what I'm seeing -- a 12 second delay between the firewall seeing the egress request and the NAT allocation.

djs_75

Just to rule it out - you're using a slim base image? normally in a case like this - based on your statement about no having support - I'd say run this report GoogleCloudPlatform/vpc-network-tester: Deploy VPC Network Tester to App Engine or Cloud Run to inve... and open an issue on the repo as it goes to the network program team but it seems they stopped supporting it 😞 GoogleCloudPlatform/PerfKitBenchmarker: PerfKit Benchmarker (PKB) contains a set of benchmarks to me... might be a repo if you run the benchmark on CR and ask the question on why the delay on 1st response, their 1st question will be are you using based slim image. With the later you could make the point that AWS or Azure doesn't exhibit this issue.

If you did have support, you could open a case but that would turn you off more as it takes time to get to the program team even if you give them everything up front.

View solution in original post

Terranca

I found this topic from last Friday which seems to have similar issues, just using VMs: https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/VMs-with-private-IPs-and-NAT-...

djs_75

for the cloud run side - do you have CPU Boost enabled? Faster cold starts with startup CPU Boost | Google Cloud Blog, 3 Ways to optimize Cloud Run response times | Google Cloud Blog

For the VPC side, Direct VPC egress with a VPC network | Cloud Run Documentation | Google Cloud.

Are you using Network tags for the firewall?

How large is the subnet (it sounds weird but it has to do with the underlying host created) ?

How did you set -- Traffic routing, select one of the following:

Route only requests to private IPs to the VPC to send only traffic to internal addresses through the VPC network.
Route all traffic to the VPC to send all outbound traffic through the VPC network

Did you setup anything in the Network Intelligence center to test egress (it would point to a route or service networking configuration issue)

djs_75

Based on above, if there are not firewall or config issues, you can look at the router with the mapping gcloud compute routers get-nat-mapping-info | Google Cloud CLI Documentation and potentially adjust the NAT rules to gcloud compute routers nats rules create | Google Cloud CLI Documentation to adjust the priority, another way is to use the NGFW with external IP to whitelist to the service or have an Application load balancer and is WAF to whitelist (understand the later come at an additional cost)

Terranca

Here's the output:

---
instanceName: ''
interfaceNatMappings:
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.0.20
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.21/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.22/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1024-1151
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.23/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.0.20
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.21/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.22/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32768-32895
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.0.23/32
  sourceVirtualIp: ''
---
instanceName: ''
interfaceNatMappings:
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.8.16
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.17/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.18/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:1152-1279
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.19/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: ''
  sourceVirtualIp: 10.0.8.16
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.17/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.18/32
  sourceVirtualIp: ''
- natIpPortRanges:
  - [EXTERNAL-STATIC-IP]:32896-33023
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 128
  sourceAliasIpRange: 10.0.8.19/32
  sourceVirtualIp: ''

The IPs match up with what I'm seeing being used in the allocation of Cloud NAT

djs_75

Additionally another way is to setup the Secure Web Proxy with NAT - creating a Private Service Connect (PSC) as it has a higher priority Deploy Secure Web Proxy as a Private Service Connect service attachment | Google Cloud

Terranca

Sure, maybe that would work. But the recommended and documented way of using a static outbound IP is using Direct VPS Egress and Cloud NAT -- it should "just work" but clearly there is a bug here... just cannot get any actual support without paying apparently.

djs_75

I agree and I haven't run into latency issues until I created them 🙂 - just providing another method I used to resolve based on use case. Even if you had paid support - you'd be luck to get assistance on a network latency issue

Terranca

Just noticed that Secure Web Proxy is a $1000/month service (or maybe not but the GCP price calculators are missing most of services so who knows..), so that would kill the use case of cloud run of a simple service. Cannot believe this simple use case is so buggy or at least undocumented.

djs_75

In the one use case, I'm using it - its about 60 GB/month and is comes out to $4.50 per month. its about .02 /gb, so idk ref calculator

Terranca

Ah that's not too bad. I saw an instance hour cost of $1.25 somewhere.

Too bad it's also listed as not being support in direct vps egress under Limitations.

djs_75

yes, $1.25 /hour for committed use according to the pricing page - as it is serverless the committed use is low unless it is really chatty 60 GB isn't a lot of traffic for the service I'm running so I'm probably under the committed use threshold

Terranca

So let me try to answer all your questions 😄

Yes, CPU boost is enabled.

No, the firewall has no rules related to tags. I do set the tag "test" or "prod" on the Cloud Run egress setting, but nothing is done with it in the firewall.

The subnet was /24, but because of your question I changed it to /21 to see what would happen: same issue.

Traffic routing is set to "Route all traffic to the VPC"

I did not use Network Intelligence -- but don't think there is a generic issue with the routing because after the delay the routing works just fine.

djs_75

Sorry, the questions were to narrow a number of scenarios - I would setup a connectivity test in Network Intelligence Center (Connectivity Tests overview | Google Cloud) with an client; Also I'd enable the Network Recommender API and use the network analyzer Use Network Analyzer | Google Cloud to look at the node path and see if either the route or FW traffic is causing the delay. I would pay attention to the RTT and SRT times for the session

Terranca

Thanks! I did both of those and there is no issue with a Connectivity Test from the VPC to the Destination IP. The Network Analyzer has no recommendations either 🙂 And it seems the Network Topology and other tools from the Network Intelligence Center don't seem to work for Serverless loads. It's just blank. So thanks for all the idea's! But still nothing 😞

djs_75

Yes, the other tools don't work with serverless

Terranca

Here is a picture that illustrates what I'm talking about: I've made a "egress-allow-all" firewall rule that just logs everything, and you can clearly see the 12 second delay between the firewall seeing the request and the NAT allocation -- even though the nat routing (as seen in one of the replies above) is already there for this source ip/port

Terranca

Tried changing so many settings but this freaking delay just keeps popping up. This is turning me off from the idea of using GCP at all to be honest. It's such a shame that this wonderful 1s cold boot time is absolutely destroyed by a 15s wait on a NAT port. Could have been amazing but guess I'll just have to give AWS a go.

djs_75

Just to rule it out - you're using a slim base image? normally in a case like this - based on your statement about no having support - I'd say run this report GoogleCloudPlatform/vpc-network-tester: Deploy VPC Network Tester to App Engine or Cloud Run to inve... and open an issue on the repo as it goes to the network program team but it seems they stopped supporting it 😞 GoogleCloudPlatform/PerfKitBenchmarker: PerfKit Benchmarker (PKB) contains a set of benchmarks to me... might be a repo if you run the benchmark on CR and ask the question on why the delay on 1st response, their 1st question will be are you using based slim image. With the later you could make the point that AWS or Azure doesn't exhibit this issue.

If you did have support, you could open a case but that would turn you off more as it takes time to get to the program team even if you give them everything up front.

Terranca

Yeah, the image is not the issue. Maybe I can try the vpc network tester some day, but as you said; even if that proves my point, who is there to listen? I think will I will just forget about this Cloud Run setup for now, go back to Azure and maybe come back in a year to see if anything improved.

Thanks for all the helpful tips, really appreciate it!

djs_75

nw, least the tester will the perfkit will provide a method to compare cross platforms. If I run into a similar issue/use case - and raise a ticket to resolution - I'll comment back

Terranca

Thanks a lot! I actually found a cheap workaround for now; I run a Cloud Scheduler that pings my service every 20 minutes -- its not enough to keep the instance active nor incur much costs, but apparently enough for the networking to remember/cache the configuration and not take 15s to setup the port binding etc. Pretty weird but it seems to work.

djs_75

so then you could do that with a liveliness probe well without needing scheduler - just give it the route and period. Sorry I overlooked it - I typically have health and liveliness probes which might be why I haven't run into the delay - thanks for sharing

Configure container health checks (services) | Cloud Run Documentation | Google Cloud

Terranca

I already have a liveness probe (on the same endpoint) as well, but that only lives as long as the instance; so not long enough to mitigate the issue. I think I'm running into this because the service has extremely bursty traffic, but ah well.. maybe we will never know for sure 😄

francois_m

@Terranca I was just wondering if you found some solution by chance.
I'm facing the exact same problem where Cloud NAT adds about 40+ seconds delay when we deploy the instance the first time. I saw above that you mentioned using the cloud scheduler to ping your service every X minutes.
Doing that works for the first instance but under some load testing I noticed that every time an additional cloud run instance is deployed, the users that are routed to that new instance also face the 40+ seconds delay (in my situation it's 40+ seconds for a simple python app using an outbound connection).
It seems like Direct VPC egress with a static outbound connection isn't working great.

Terranca

Hi, no I never found the solution and nobody from Google ever responded, so it looks like we're stuck with this delay until a high paying Enterprise customer runs into this.

I would not expect the requester to wait for the second instance to spin up when you have liveness checks etc as it could do the spin up in the background, but maybe its designed that way? Edit: yeah, seems it indeed designed that way .. that's a shame..

Hope you have more luck finding a proper solution.

francois_m

@Terranca Yeah it seems like we'll have to wait that Google fixes it indeed, that's very unfortunate.
Thank you so much for your answer!

Long delay on initial outbound connection with Gen1 Cloud Run with Direct VPC Egress+Cloud NAT