Solved: Re: Compute Engine VM randomly loses connectivity

bradgravesen · 07-21-2023 02:24 AM

I have a Compute Engine VM (Debian 11, Apache 2.4, PHP 8.2, MariaDB 11) hosting about 40 websites. About three weeks ago, the server went unreachable for no apparent reason. No response on http, ftp or ssh. Pinging any of the domains on the server - nothing. Total loss of connectivity. I found I could (temporarily) fix it by resetting the VM using the reset button in the cloud console. But then it continued to drop about once an hour, or so. So as a temporary fix, I setup an hourly cron job to reboot the server, which at least minimized the downtime for my hosted websites. I have gone through the system logs. Fixed anything I thought might be even remotely related to something like this. But it just keeps dropping connectivity every hour, give or take. Have any of you seen anything like this before? I am at a total loss - stumped! I'll be grateful for any help I can get on this one.

bradgravesen

So I turned off my repeating reboot Cron job and waited for an hour, watching the logs stream in as the time approached. I watched as the errors started piling up immediately prior to the 1 hour mark. And it became clear that all of the errors had to do with internal network failure between my server and Google's metadata server. About the same time I saw this failure happen live, I also noticed a strange thing in the logs at boot time - for the first couple of seconds after booting the system, the log entries were shown in UTC (Universal Time - same as GMT - Greenwich Mean Time) rather than local time (Pacific). This meant that the first 100 or so log entries were 7 hours later (in the future) compared to local time.

And here's where it all came together!

Those events that were logged in universal time, appeared to the system as being in the future, once the system clock adjusted to local time a couple of seconds after booting. And it was those events being logged "in the future" that caused the problem. When the system went to log additional events within the same subsystems that logged the events in UTC, it couldn't do it, and it threw an error that essentially said "failed to log this event, because it's not possible to log this event in the future." The system logs events in the order they happen, and it includes a time stamp. So the next event to write to the log, would have to be written AFTER an event that happened 6 or 7 hours "in the future" which of course is impossible. This didn't cause a fatal error until an hour later, however, because it wasn't until a critical network-related call was made from my server to Google's metadata server, a call that happens an hour after booting, where the apparent time paradox broke the network protocol in a way that started sending network-unreachable messages and breaking ALL network communication to and from the server.

Once I knew the cause, the solution was clear. I determined that the problem was being caused by the server's hardware clock being set to sync with local time rather than staying on UTC. It's counter intuitive. But if you leave the hardware clock on UTC and set the time zone for the system clock for local time, the system understands this, translates the time to local time for the system clock and all log entries are always logged correctly in local time.

I made this change and watched and waited. The system passed the one hour mark error free. And it has now been about 3 days since the most recent boot, without a problem. So that's fantastic news! This is the solution and I will mark it as solved.

View solution in original post

Willbin

Hello @bradgravesen,

Welcome to Google Cloud Community!

It is best to always check the logs to see what is really happening on your VMs. Even though the situations may be similar, the root causes may be different. You can post the error message here so the community can help you much easier.

Thank you!

bradgravesen

So I turned off my repeating reboot Cron job and waited for an hour, watching the logs stream in as the time approached. I watched as the errors started piling up immediately prior to the 1 hour mark. And it became clear that all of the errors had to do with internal network failure between my server and Google's metadata server. About the same time I saw this failure happen live, I also noticed a strange thing in the logs at boot time - for the first couple of seconds after booting the system, the log entries were shown in UTC (Universal Time - same as GMT - Greenwich Mean Time) rather than local time (Pacific). This meant that the first 100 or so log entries were 7 hours later (in the future) compared to local time.

And here's where it all came together!

Those events that were logged in universal time, appeared to the system as being in the future, once the system clock adjusted to local time a couple of seconds after booting. And it was those events being logged "in the future" that caused the problem. When the system went to log additional events within the same subsystems that logged the events in UTC, it couldn't do it, and it threw an error that essentially said "failed to log this event, because it's not possible to log this event in the future." The system logs events in the order they happen, and it includes a time stamp. So the next event to write to the log, would have to be written AFTER an event that happened 6 or 7 hours "in the future" which of course is impossible. This didn't cause a fatal error until an hour later, however, because it wasn't until a critical network-related call was made from my server to Google's metadata server, a call that happens an hour after booting, where the apparent time paradox broke the network protocol in a way that started sending network-unreachable messages and breaking ALL network communication to and from the server.

Once I knew the cause, the solution was clear. I determined that the problem was being caused by the server's hardware clock being set to sync with local time rather than staying on UTC. It's counter intuitive. But if you leave the hardware clock on UTC and set the time zone for the system clock for local time, the system understands this, translates the time to local time for the system clock and all log entries are always logged correctly in local time.

I made this change and watched and waited. The system passed the one hour mark error free. And it has now been about 3 days since the most recent boot, without a problem. So that's fantastic news! This is the solution and I will mark it as solved.