Solved: Re: Enabling oslogin causes system to have high Di...

geoff-wild · 06-12-2023 12:37 PM

We are getting rid of project-wide SSH keys and enabling OS Login.

For some reason, doing so causes the instance to have high disk IO - to the point where the system is inoperable. Has happened on 2 out of 14 systems converted. One is on Ubuntu 20.04 the other Debian 11. The systems where it was fine, Debian 11 and Ubuntu 22.04.

Even after shutting the instance down and restarting it, within a few seconds, the system is again inoperable and in Observability - high disk IO.

Any way to fix this or why it's happening???

Screenshot from 2023-06-12 12-35-50.png

geoff-wild

Update - issue solved.

We had another outage today on our VPN server - which was a blessing in disguise.
My colleague had noticed that Promtail wasn't running on it, so he started it up.
Withing 2 minutes, the server was hosed, issue is, OSLogin makes use of /var/log/lastlog and we had Promtail configured to scrape all logs in the /var/log directory.

We changed Promtail's config to only look at:

/var/log/*.log and /var/log/syslog

As you know, lastlog is a "sparse" file but it was causing Promtail to lose it's mind as it can grow (but not really taking up disk space).

Rgds...Geoff

View solution in original post

geoff-wild

More information:

Even after backing out the "change to OS Login", the system is still inoperable.

Had to shutdown, remove boot disk, create new boot disk from a recent snapshot.

Sure be nice if a Google engineer would partake in these forums...

Roderick

There are a few possible reasons why enabling OS Login could be causing high disk IO on your instances.

One possibility is that the OS Login daemon is constantly trying to authenticate users, even when there are no users trying to log in. This can happen if the daemon is configured incorrectly or if there is a problem with the authentication service.

Another possibility is that the OS Login daemon is creating a large number of temporary files. This can happen if there are a lot of users trying to log in at the same time or if there is a problem with the file system.

To troubleshoot this issue, you can try the following:

Check the OS Login daemon configuration to make sure it is correct.
Check the authentication service to make sure it is working properly.
Monitor the disk usage on the instance to see if there is a large number of temporary files being created.

If you are still unable to resolve the issue, you can contact Google Cloud support for assistance.

Here are some additional troubleshooting steps you can take:

Check the system logs for any errors related to OS Login.
Use the strace command to trace the system calls made by the OS Login daemon. This can help you to identify the source of the high disk IO.
Use the iotop command to monitor the disk usage on the instance. This can help you to identify which processes are causing the high disk IO.

Once you have identified the cause of the high disk IO, you can take steps to resolve the issue. For example, if the OS Login daemon is constantly trying to authenticate users, you can disable the daemon or configure it to only authenticate users when necessary. If the OS Login daemon is creating a large number of temporary files, you can increase the amount of available disk space or configure the daemon to delete temporary files after a certain period of time.

geoff-wild

Thanks for that detailed reply, I appreciate it.

As far as "OS Login daemon configuration", there isn't any?
I followed this guide:  https://cloud.google.com/compute/docs/oslogin/set-up-oslogin

Installing on a per instance level, so the only configuration is setting metadata:

enable-oslogin TRUE

I have enabled this on 28 out of 39 instance we have. The issue I posted has happened on only 4. During that time, the system is inoperable, so I can not run any commands at all - I can only watch the load via GCP console.

Editing the instance and removing that configuration doesn't fix it.
1 instance, which is very critical, we used a snapshot and restored the system just prior to the change. The other 3, we waited, one took 4 hours and the other 2 took 5 hours and suddenly, the systems were fine.

It's almost as if some one hot swapped a mirror disk and mirroring put a strain on the primary disk.

We are a small organization with only 4 of us that actually login with shell.

Systems are mostly Debian 11 and a few Ubuntu 22.04.

Hate to use the cliche, but I have been a SysAdmin for over 30 years and I have never seen something like this before. 🙂

But I do want to understand it, that is, when you enable oslogin, what happens to an instance?
Does it read all the inodes for some reason?
Why did only 4 systems (so far) get hit with this?

geoff-wild

Sigh, signed into a system today with OS-Login enabled - and the CPU spiked and the disk is thrashed with IO....

geoff-wild

Update - issue solved.

We had another outage today on our VPN server - which was a blessing in disguise.
My colleague had noticed that Promtail wasn't running on it, so he started it up.
Withing 2 minutes, the server was hosed, issue is, OSLogin makes use of /var/log/lastlog and we had Promtail configured to scrape all logs in the /var/log directory.

We changed Promtail's config to only look at:

/var/log/*.log and /var/log/syslog

As you know, lastlog is a "sparse" file but it was causing Promtail to lose it's mind as it can grow (but not really taking up disk space).

Rgds...Geoff

Enabling oslogin causes system to have high Disk IO