Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

High Disk Read + Iops in Vertex AI VM - Machine Unresponsive then recovers

We are running several VMs through Vertex AI Workbench.

Each VM experiences this issue once every 2-3 days. We are only running jupyterlab notebooks on it and a simple python based training script via "screen" [screen python3 train.py] (so it can run without being connected to the Jupyter notebook). There's no other processes running on the VM, and the training script reads from a relatively small (50MB) data file and does a new epoch every 10 min or so.

The VM works as expected, and the training script is successfully training the model as expected. However, every once in a while (1x per 2-3 days), the machine will do the following:

1) Very high READ Disk Usage (30MiB/s) for about 20 minutes that doesn't make sense given the running processes since normal Read Disk usage during training are < 0.05/s. During this 20 min, WRITE Disk Usage is normal

2) Very high Read IOPS (>800/s) during the same period of 20 minutes - normal Read IOPS during training are < 0.05/s.    During the 20 min, WRITE IOPS are normal

3) The [screen python3 train.py] is killed and CPU usage drops from 50% to zero (makes sense since the screen process is terminated so the VM is not doing anything anymore).

4) The machine becomes unresponsive. Cannot login via SSH, cannot connect to the JupyterLab notebook

5) After the high Disk usage is finished, the machine returns to normal and we can connect via SSH and/or JupyterLab. However, since the screen [python3 train.py] process was terminated, it has to be restarted manually. 

These are pretty standard Google Compute VM's, but with the following specific setup: 

Environment TensorFlow Enterprise 2.9 (Intel® MKL-DNN/MKL)
Environment versionM94
Machine type n1-standard-4 (4 vCPUs, 15 GB RAM)
GPU NVIDIA Tesla T4 x 1
Boot disk 100 GB disk
Data disk 100 GB disk
 
Troubleshooting so far: 
 
1) Reviewed kern.log. It does show that python3 is the process that seems to do this. Looks like it causes OOM and then it is auto-killed. 
 
2) We installed the log agent to track the RAM usage (presumably it will be high/OOM since kern.log shows this) and will update here
 
3) I've had a somewhat similar issue a while back with Azure Linux VMs. It was due to clamav (an antivirus package installed by default on Azure Linux VMs) that was doing an update every so often. After disabling clamav completely, the issue did not reoccur. However, on Vertex AI VMs, there is no clamav, so this is not a possible issue. 
 
Question
Has anyone experienced this? How can we find out why python3 is causing this OOM/disk usage issue intermittently? Is the high disk usage just due to the system trying to use a pagefile because it's OOM? Is there a way to see what it's reading during that high disk usage time? 
 
0 1 430
1 REPLY 1

On the next link you can read a way to troubleshoot what is happening in your case, and what you can do to troubleshoot it, also this might be happening because as you mention python might be using a lot of memory while it’s running.