I am experiencing an issue with a GPU VM instance launched via the google cloud console web UI.
I launched a "a2-ultragpu-1g" instance using the "Deep Learning VM for PyTorch 2.3 with CUDA 12.1 M125" machine image in "us-central1-a". This previously worked as expected. However, now I get a "Segmentation fault" when trying to launch python:
```bash
(base) my-user-name@my-instance-name:~$ python
Python 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Segmentation fault
```
First, I stopped and re-started the instance, this did not help. Secondly, I deleted the instance and created an identically configured instance again, but the problem persists. The instance is using the official "Deep Learning VM for PyTorch 2.3 with CUDA 12.1 M125" machine image including Nvidia drivers.
Just to be clear: The "Segmentation fault" happens on a fresh VM instance, after ssh-ing into the instance for the first time, when running "python" in the command line.
Hi @52189b27d02,
Welcome to Google Cloud Community!
The "Segmentation fault" error in your Python session means that your Python program attempted to access memory it wasn't allowed to access. This error indicates a possible bug in your code or a problem with a library you're using.
Here are some guides and along with links to relevant documentation:
Investigate your kernel logs and it contains the most valuable information.Try to run dmesg
immediately after SSHing into the instance, before attempting to run python.
Note: Error messages may be overwritten quickly. Look for errors related to memory, the kernel, NVIDIA drivers, or anything unusual around the time of the instance boot.
Make sure all your libraries are up-to-date you may use venv tool that creates isolated Python environments. These isolated environments can have separate versions of Python packages, which lets you isolate one project's dependencies from the dependencies of other projects.
Note: If you're using Anaconda, follow the instructions on their website.
For more information, you may refer to these Google Cloud documentations:
We encourage you to file a public issue tracker. Keep in mind that there's no set time-frame for resolving it. If Segmentation Fault errors still persist and investigate further more please feel free to reach out to our support team.
I hope the above information is helpful.
Again:
> Just to be clear: The "Segmentation fault" happens on a fresh VM instance, after ssh-ing into the instance for the first time, when running "python" in the command line.
In other words: I am not running custom python code. I launched a "a2-ultragpu-1g" instance using an official machine image. There is no custom code or custom configuration. It is a fresh VM running an official machine image. After launching the VM, I type "python" on the bash command line, press enter, and there is a segmentation fault.
I tried this with several machine images:
- Deep Learning VM for PyTorch 2.4 with CUDA 12.4 M126
Debian 11, Python 3.10, with PyTorch 2.4 and fast.ai preinstalled.
- Deep Learning VM for PyTorch 2.3 with CUDA 12.1 M125
Debian 11, Python 3.10, with PyTorch 2.3 and fast.ai preinstalled.
- Deep Learning VM with CUDA 12.4 M126
Debian 11, Python 3.10. With CUDA 12.4 preinstalled.
And I tried in these regions:
- us-central1-a
- us-east4-c
I tried to file a technical support case, but I did not have the permission.
For future reference / in case anyone else encounters this problem:
> Status: Won't Fix (Infeasible)
https://issuetracker.google.com/issues/380416868
So the Google Cloud Support has decided not to fix this issue because it is "Infeasible".
It gets even weirder. This problem only occurs when using an existing ssh key. In other words, if I add my own, existing ssh key when launching the VM from the gcloud web UI, I get a "Segmentation fault" when trying to use python. In contrast, if I do not add my own ssh key, and create a new one (after launching the instance) with the gcloud CLI, there is no problem.
This is rather weird. The problem is not with the login, I can ssh into the instance with my own key.
On other cloud providers I routinely used existing ssh keys for login to VM instances without any issues.