Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Experiencing Micro-Outages Over VPN Tunnel: Any Suggestions?

We are experiencing network issues between one of our instances and a client, connected via a VPN tunnel. These issues result in 3 to 4 network micro-outages per day. During these micro-outages, our SaaS application becomes inaccessible, causing significant disruption.

How can we effectively detect and visualize such micro-outages? Has anyone else experienced similar issues, and are there best practices or tools to analyze and mitigate these problems?

Thank you for your insights.

0 3 173
3 REPLIES 3

Hey @saray_hach hope all is well.

I think there is some missing context, which could help with suggestions. I understand you have intermittent connectivity issues in communications between endpoint in GCP and on-prem over Cloud VPN built on top of Cloud interconnect (based on topic labels). These issues manifest in a number of short outages throughout a day. How did you narrow it down to the Cloud VPN being the cause (or even have you?) and what you already looked at? Generally speaking Cloud Monitoring and Cloud Logging would be my first thing to check. This could help you visualize/analyze things like - interconnect uptime, bgp uptime, ipsec uptime, specific error messages in logs at the time of the issue. Did you look at these, what were your findings?

Hello,

Thanks for pointing this out!

You’re right; I haven’t entirely ruled out other potential causes beyond the Cloud VPN itself. The reason we suspect the VPN is that the micro-outages seem to align with tunnel renegotiation events. However, I agree this could also be influenced by interconnect or on-prem issues.

To your point, I’ve looked at Cloud Monitoring, but I’ll revisit it and check metrics that you suggest.

I haven’t yet reviewed Cloud Logging for error messages specifically at the times of these outages, so I’ll focus on that next. Do you have any specific filters or tips for identifying relevant entries in the logs for these scenarios ?

Thank you for your help

Here you can find a number of Cloud Logging filters that could help with troubleshooting the issue with Cloud VPN specifically and here - some common issues that can be generally found with the product. It is sort of difficult to narrow it down to what you could look at without doing some elimination first. See if the interconnect stays up first, if you have access - also look at the on-prem logs to see if that gives you more ideas of what could be happening. Check the lifetimes for both phases (this is an often cause of excessive rekeying). Look at who initiates the rekeying, try to figure out if this was supposed to happen (did the phase reach it's lifetime). Yet like I said before, for any of this to become relevant you have to establish whether the rekeying is a cause or a symptom though.