Solved: Compute Engine VM has lost its storage suddenly - Page 2

gavinharriss · 03-24-2023 01:03 AM

My compute engine VM hosting a container image has suddenly stopped working despite looking as if it's still running fine. It's not been working for about 24 hours now despite everything I try. It's been stable since I provisioned it a few years back, apart from the occasional restart being required because of the hosted app requiring an off and on again.

I've tried multiple times to restart the VM and even started and stopped it, resulting in new public IP addresses being assigned. Nothing has managed to get it working again.

I can ping the public IP but am unable to reach the hosted application on port 8080, which has been enabled in the firewall. Turning on firewall logging I can see that my allow-http-8080 rule is still being used without issue.

My container application is not stateless so requires storage. The VM instance size is e2-medium, so should have 10 GB. It's deployed to us-west1-b.

Looking through the logs I see a few worrying messages that make me suspect storage is an issue:

Warning: Failed to run "google_optimize_local_ssd" script: exec: "google_optimize_local_ssd": executable file not found in $PATH

Warning: Failed to run "google_set_multiqueue" script: exec: "google_set_multiqueue": executable file not found in $PATH

If I try to connect to the VM using SSH it fails to connect. Then if I troubleshoot the connection it stalls on an endless busy phase on the "User permissions" stage of troubleshooting:

In the logs I then see the following error:

Error: Error creating user: useradd: failed to reset the lastlog entry of UID 20162: No space left on device useradd: cannot create directory /home/mail.

So it sounds like the VM is no longer able to access the storage it needs.

Nothing has been changed from my end, the VM simply stopped working correctly about 24 hours ago.

Any suggestions about how I might be able to resolve this issue?

kolban

When you run a Compute Engine (a VM) it runs an OS (normally Linux or Windows). When you ask it to run a Docker image, the OS running in the VM is an instance of Linux that ALSO has Docker installed/configured. We can read about this OS at:

https://cloud.google.com/container-optimized-os/docs

The net of this is that if you use the same Compute Engine instance over and over again (which is fine) then that instance of the OS might need to have its logs cleaned. However, when you stop/restart the VM, it is indeed the case that the Docker image that is run inside the VM is pulled from Artifact Registry but the instance of the OS on the VM remains between stops/starts. Lets be clear that it was only my guess that logs filled up the filesystem. We will learn more when you start it with serial console login enabled. Once we have a shell prompt and run "df", my hope is that we will find a full disk that can be cleaned ... but the problem MAY be elsewhere. Also, we will want to find out the "distribution" of files on the local VM filesystem. We may find that it is something other than logs that might be being written ... to be continued 🙂

View solution in original post