GCP Compute instance *sporadically* not accessible...

wgenik · 12-19-2022 07:45 AM

Hello community.

I have a GCP compute instance (Ubuntu 16.04.7 LTS (GNU/Linux 4.15.0-142-generic x86_64) that hosts a live-facing http/https webserver and APIs etc, and it has been running well for the past ~5 years with minimal issues.

I am a developer and work remotely from home much of the time, and have recently switched to Starlink for my home connection. This past week, I've noticed that I'm suddenly being blocked sporadically from connecting to the instance on any port from my home connection, yet the site and all services (ssh, etc) are fully accessible from my office systems (verified via RDP). Yet occasionally I can access the instance again for several minutes at a time before it goes dark, seemingly randomly.

Here's what I've tried/confirmed so far:

During outage periods (as seen from home Starlink connection):

- Cannot ping instance from my main home machine or any other system on my home network, even though my (current dynamic, CG-NAT) IPv4 address is explicitly whitelisted/allowed on GCP VPC Network for ping and all ports

- Systems correctly identify GCP's static IP during DNS query of my several domains/subdomains attached to this instance, so it doesn't appear to be DNS (even though "it's always DNS!")

- Can't access HTTP/HTTPS/SSH/other ports that are explicitly opened and normally work fine for service access

- Tracert.exe shows successful hops all the way to and including "142.250.170.202" (AS15169 · Google LLC), which appears to be the last hop before my final destination IP address of GCP instance

- CAN access all said services/ping/tracert normally and without ANY interruption from my office network and from public internet

- CAN access my instance website via browser Proxy or IsUp.me check tools, etc, from my home connection

- Checked via ssh (from office through RDP) the Ubuntu iptables/ufw settings and don't see any entries that would be blocking, tried temporarily disabling ufw and no change in access from home connection

- Instance shows all good from GCP Console and all statuses are green etc.

Now, if I remote in via ssd (from office through RDP) and reset the entire instance ("shutdown -r now") when the instance comes back online, I have full access to it from my home Starlink connection

- CAN ping, browse website, access all services on various ports

- Tracert.exe successfully goes all the way to final IP address (one hop further than when "blocked" for final destination)

However, after a short (2-5 minutes it seems) period of time, the instance is no longer accessible

If I leave for long periods of time and then try again, it might be fine for a few minutes again but then goes long periods of no access (or I'm missing the small windows that it's accessible for whatever reason)

It certainly seems to me that my GCP instance is what is doing the blocking (since the issue goes away temporarily if I reboot the instance) instead of it being an issue with Starlink or the DNS / routes in between my home connection and the instance.

What services / statuses / blacklist tables can I be checking on Ubuntu to see if my IP is being temporarily blocked? Is there some kind of a soft-block that is happening on the GCP VPC network that I need to look into? I think I've checked sshguard and don't see it auto-blocking me (and it happens even when I haven't tried to access ssh during the uptime)

Any help or further troubleshooting steps are greatly appreciated! Very frustrating...

Thanks!!!

siegfredv

Check its serial logs or try to increase your VM Instance to a higher machine type or RAM capacity.

If upgrading the machine type or increasing RAM still does not work, it would be best to be in touch with a Technical Support Representative so they can look into your VM Instance.

444b

Hi Wgenik
What an interesting question!
Have you done a Pcap on the connection and logged the results when it when it breaks?
Also, what is seen from the Cloud instance in terms of errors and stderror?
Lastly, do you have access to the Cloud environment and can look through the Cloud Logging? This may be able to provide deeper insight into the connections and errors made by the VM

wgenik

So strangely enough, not long after the first reply on Tuesday, the instance started functioning normally again with no interruptions. So either something changed in my Starlink route, or internally on the instance itself. Either way, it's no longer dropping my connections and has been rock solid ever since. If the problem resurfaces, then I will restart investigations.

Thanks all.

GCP Compute instance *sporadically* not accessible from Starlink connection

GCP Compute instance sporadically not accessible from Starlink connection