Hi @vulture903,
Welcome to Google Cloud Community!
You may consider using reservations to guarantee capacity for your Compute Engine VMs. This helps ensure you'll have resources available even when demand increases.
You can explore GPU Documentation that provides an overview of GPUs in GCP, the features, and how to use them. Also, you can check GPU Instance Types where there are detailed specs and configuration options for different GPU instance types.
If the issue still persists and needs further assistance, please feel free to reach out to our support team.
I hope the above information is helpful.
Hi @reinc,
Thank you for your response and for the suggestions. I appreciate the information on using reservations and the documentation links provided.
Unfortunately, I have already attempted several approaches, including leveraging the reservation feature, but encountered limitations due to the lack of available resources in the required zones. Even when I tried creating future reservations, I was met with restrictions stating that I am not currently eligible to use this feature based on my account's usage history.
Additionally, I recently requested an increase in my gpus_all_regions quota and received approval for 4 GPUs. Despite this, I still face issues with resource availability when attempting to deploy the required N1-standard-8 instance with T4 GPUs in various regions and zones.
I've also thoroughly explored different regions and GPU configurations, but the unavailability of resources remains a persistent issue. Given the urgency, any additional guidance or support would be invaluable, especially in identifying regions or zones where I could deploy the instance successfully without further obstacles.
Would it be possible to escalate this issue or perhaps receive more specific recommendations regarding zones with greater availability for N1-standard-8 instances with T4 GPUs? Alternatively, if there are any other solutions or configurations I can explore to meet my needs, I'd appreciate your insight.
Thanks again for your assistance, and I look forward to your advice.
Just want to add more pressure to this post. @vulture903 you're describing our problem exactly. We started a new project a year ago and just this past month or so began testing our workflows. We only ever try to spin up 1 NVIDIA T4 with varying success. Sometimes it works but often times we get the ZONE_RESOURCE_POOL_EXHAUSTED message. We too have tried multi-zone and increasing our CPU/Memory count, but have yet to try reservations.
This does not give us confidence to pursue these GPU type workflows with GCP, especially with the lack of insight into errors, availability.
FWIW we only have ever been approved for 1 GPU for gpu all regions. Even though we have lots of database and storage activity.
Hi @acspock,
Thank you for sharing your experience. It resonates deeply with what I’ve been facing. I appreciate knowing I'm not the only one grappling with this issue.
After numerous attempts and countless hours of trial and error, I managed to deploy the N1-standard-8 instance with a T4 GPU in the us-west3-b region, which is one of the more expensive regions (Salt Lake City). Unfortunately, my relief was short-lived. While the instance ran fine for a few hours, as soon as I needed to stop it and later tried to restart, I encountered the same ZONE_RESOURCE_POOL_EXHAUSTED error. It was as if all progress was immediately undone.
This persistent issue is beyond frustrating and genuinely concerning, especially for projects with critical timelines and substantial GPU demands. As you mentioned, having no clear insight into resource availability and experiencing unpredictable success in deploying instances undermines the confidence necessary to scale or even continue GPU-dependent workflows on GCP.
I strongly believe Google Cloud needs to address this gap with a more robust approach, ensuring users have a minimum level of infrastructure security and reliability for their projects. Reliable access to resources is not just a "nice-to-have"—it's a baseline expectation.
Thank you for your input, and I hope we can push for more transparency and solutions regarding this critical issue.
Hey @vulture903 I have a small update.
For starters, I was able to spin up an L4 and keep it alive for a while while we tested. More importantly, we ended up scheduling a sales call with GCP/Google and met with a handful of individuals including an account manager, sales, and technical solution members. We were told that there are internal controls around GPU's and quota management.
After discussing with them our project, our goals and technical approach, we were told there are internal steps that need to be done with our account before a quota increase was initiated.
Now, even after seeking clarification from the google reps, it wasn't clear if this would resolve our `ZONE_RESOURCE_POOL_EXHAUSTED`.
That was a little over a week ago and we're still waiting for their "internal process" for our account before upgrading our quota.
I'll keep this thread up to date with our progress and let you know if this process unblocks us from spinning up resources at will.
Cheers.
Hey @vulture903 I have some good news. After weeks of GCP's internal process it appears we finally have our quota increases
It looks like we received what we asked which was eight T4 NVIDIA and sixteen L4 GPU's
We also received 300 CPU limit for C4 generation.
The interesting thing is now when I create VM's, our all_gpu_regions metric no longer shows usage, but now the quota values appear directly to a specific resource.
EG: NVIDIA T4 GPUs / usage: 2 limit: 8 and same for NVIDIA L4 GPUs metric.
GPUs all region doesn't appear to change at all now.
so conclusion is that GCP likely flipped some internal switches to get us the quota and access that we needed. I would recommend speaking with a sales team and tell them your issues and what you're trying to achieve even for a small project.