Hi. I have a couple of Compute Engine instances that make use of a GPU and were provisioned with the Deep Learning ML images. They work fine most of the time.
But sometimes, after restart, the NVIDIA drivers won't load. And I must manually reinstall them following the usual instructions. Now, reinstalling them always fixes the problem. But the challenge is that I cannot automate the boot the start-stop of the machine, because every time I boot it up is a lottery. I never know when is it gonna fail.
Thanks.
Hi @ManuelMeterian,
Welcome to Google Cloud Community!
Here are some guide to troubleshoot NVIDIA drivers connected to your issues:
You may also check this document for best practices in building reliable systems on Compute Engine. It offers general tips and explains features that can help reduce downtime and handle unexpected VM failures.
I hope the above information is helpful.