We are running several VMs through Vertex AI Workbench.
Each VM experiences this issue once every 2-3 days. We are only running jupyterlab notebooks on it and a simple python based training script via "screen" [screen python3 train.py] (so it can run without being connected to the Jupyter notebook). There's no other processes running on the VM, and the training script reads from a relatively small (50MB) data file and does a new epoch every 10 min or so.
The VM works as expected, and the training script is successfully training the model as expected. However, every once in a while (1x per 2-3 days), the machine will do the following:
1) Very high READ Disk Usage (30MiB/s) for about 20 minutes that doesn't make sense given the running processes since normal Read Disk usage during training are < 0.05/s. During this 20 min, WRITE Disk Usage is normal.
2) Very high Read IOPS (>800/s) during the same period of 20 minutes - normal Read IOPS during training are < 0.05/s. During the 20 min, WRITE IOPS are normal.
3) The [screen python3 train.py] is killed and CPU usage drops from 50% to zero (makes sense since the screen process is terminated so the VM is not doing anything anymore).
4) The machine becomes unresponsive. Cannot login via SSH, cannot connect to the JupyterLab notebook
5) After the high Disk usage is finished, the machine returns to normal and we can connect via SSH and/or JupyterLab. However, since the screen [python3 train.py] process was terminated, it has to be restarted manually.
These are pretty standard Google Compute VM's, but with the following specific setup:
On the next link you can read a way to troubleshoot what is happening in your case, and what you can do to troubleshoot it, also this might be happening because as you mention python might be using a lot of memory while it’s running.