Re: Unable to Create Instance with NVIDIA T4 GPU i...

vulture903 · 10-10-2024 07:56 AM

Hello community,

I'm reaching out with an issue that is critically impacting my project. I need to deploy an N1-standard-8 instance with an NVIDIA T4 GPU, but after trying numerous regions and zones, I consistently encounter errors related to resource availability. I've exhausted all options that I can think of, and nothing seems to work. Below is a detailed summary of the situation:

Context:

- Instance Type: N1-standard-8 with NVIDIA T4 GPU.

- Regions and Zones Tried: I have tried deploying the instance in several zones across regions like us-east1, us-west1, us-central1, us-west2, us-east4, and many others, but without success.

Errors Encountered:

- ZONE_RESOURCE_POOL_EXHAUSTED: This error appears across almost all zones, indicating that there are no available resources for an N1-standard-8 instance with a T4 GPU.

- GPU Not Found: In some zones (e.g., us-east1-b), I receive errors stating that the NVIDIA T4 GPU is not available in that zone.

What I’ve Already Tried:

- I've tested increasing the machine size to N1-standard-16 to see if higher-capacity instances have more availability, but I encounter the same issues.

- I've checked project quotas, and everything is in order, so quotas are not the limiting factor.

- I also considered switching to other GPU types like P4 and V100, but the T4 is the most suitable for my workload due to its balance between performance and cost.

- I attempted using preemptible instances to reduce the strain on resources, but I still encountered the same resource availability errors.

This situation is now critically affecting the progress of my project. The GPU resources are necessary for running computationally intensive tasks, and without these resources, I'm unable to proceed. The deadline for deployment is fast approaching, and not being able to provision the required infrastructure is causing serious concern.

I need assistance in identifying any region or zone with availability for N1 instances with NVIDIA T4 GPUs. Additionally, I would appreciate understanding why there is such widespread unavailability of resources across all the zones I’ve tried, and whether there are plans to address this issue. Any guidance or alternative solutions that would allow me to deploy this instance with the necessary GPU configuration would be appreciated.

I'm happy to provide additional details if necessary, and I would greatly appreciate any help or guidance.

Thank you in advance!

reinc

Hi @vulture903,

Welcome to Google Cloud Community!

You may consider using reservations to guarantee capacity for your Compute Engine VMs. This helps ensure you'll have resources available even when demand increases.

You can explore GPU Documentation that provides an overview of GPUs in GCP, the features, and how to use them. Also, you can check GPU Instance Types where there are detailed specs and configuration options for different GPU instance types.

If the issue still persists and needs further assistance, please feel free to reach out to our support team.

I hope the above information is helpful.

vulture903

Hi @reinc,

Thank you for your response and for the suggestions. I appreciate the information on using reservations and the documentation links provided.

Unfortunately, I have already attempted several approaches, including leveraging the reservation feature, but encountered limitations due to the lack of available resources in the required zones. Even when I tried creating future reservations, I was met with restrictions stating that I am not currently eligible to use this feature based on my account's usage history.

Additionally, I recently requested an increase in my gpus_all_regions quota and received approval for 4 GPUs. Despite this, I still face issues with resource availability when attempting to deploy the required N1-standard-8 instance with T4 GPUs in various regions and zones.

I've also thoroughly explored different regions and GPU configurations, but the unavailability of resources remains a persistent issue. Given the urgency, any additional guidance or support would be invaluable, especially in identifying regions or zones where I could deploy the instance successfully without further obstacles.

Would it be possible to escalate this issue or perhaps receive more specific recommendations regarding zones with greater availability for N1-standard-8 instances with T4 GPUs? Alternatively, if there are any other solutions or configurations I can explore to meet my needs, I'd appreciate your insight.

Thanks again for your assistance, and I look forward to your advice.

acspock

Just want to add more pressure to this post. @vulture903 you're describing our problem exactly. We started a new project a year ago and just this past month or so began testing our workflows. We only ever try to spin up 1 NVIDIA T4 with varying success. Sometimes it works but often times we get the ZONE_RESOURCE_POOL_EXHAUSTED message. We too have tried multi-zone and increasing our CPU/Memory count, but have yet to try reservations.

This does not give us confidence to pursue these GPU type workflows with GCP, especially with the lack of insight into errors, availability.

FWIW we only have ever been approved for 1 GPU for gpu all regions. Even though we have lots of database and storage activity.

vulture903

Hi @acspock,

Thank you for sharing your experience. It resonates deeply with what I’ve been facing. I appreciate knowing I'm not the only one grappling with this issue.

After numerous attempts and countless hours of trial and error, I managed to deploy the N1-standard-8 instance with a T4 GPU in the us-west3-b region, which is one of the more expensive regions (Salt Lake City). Unfortunately, my relief was short-lived. While the instance ran fine for a few hours, as soon as I needed to stop it and later tried to restart, I encountered the same ZONE_RESOURCE_POOL_EXHAUSTED error. It was as if all progress was immediately undone.

This persistent issue is beyond frustrating and genuinely concerning, especially for projects with critical timelines and substantial GPU demands. As you mentioned, having no clear insight into resource availability and experiencing unpredictable success in deploying instances undermines the confidence necessary to scale or even continue GPU-dependent workflows on GCP.

I strongly believe Google Cloud needs to address this gap with a more robust approach, ensuring users have a minimum level of infrastructure security and reliability for their projects. Reliable access to resources is not just a "nice-to-have"—it's a baseline expectation.

Thank you for your input, and I hope we can push for more transparency and solutions regarding this critical issue.

acspock

Hey @vulture903 I have a small update.

For starters, I was able to spin up an L4 and keep it alive for a while while we tested. More importantly, we ended up scheduling a sales call with GCP/Google and met with a handful of individuals including an account manager, sales, and technical solution members. We were told that there are internal controls around GPU's and quota management.

After discussing with them our project, our goals and technical approach, we were told there are internal steps that need to be done with our account before a quota increase was initiated.

Now, even after seeking clarification from the google reps, it wasn't clear if this would resolve our `ZONE_RESOURCE_POOL_EXHAUSTED`.

That was a little over a week ago and we're still waiting for their "internal process" for our account before upgrading our quota.

I'll keep this thread up to date with our progress and let you know if this process unblocks us from spinning up resources at will.

Cheers.

acspock

Hey @vulture903 I have some good news. After weeks of GCP's internal process it appears we finally have our quota increases

It looks like we received what we asked which was eight T4 NVIDIA and sixteen L4 GPU's

We also received 300 CPU limit for C4 generation.

The interesting thing is now when I create VM's, our all_gpu_regions metric no longer shows usage, but now the quota values appear directly to a specific resource.

EG: NVIDIA T4 GPUs / usage: 2 limit: 8 and same for NVIDIA L4 GPUs metric.

GPUs all region doesn't appear to change at all now.

so conclusion is that GCP likely flipped some internal switches to get us the quota and access that we needed. I would recommend speaking with a sales team and tell them your issues and what you're trying to achieve even for a small project.

Unable to Create Instance with NVIDIA T4 GPU in Any Region/Zone