Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Running AI/ML workloads with Cloud Run + GPU (In Preview)

Agreed Cloud Run with GPU is a big plus which opens doors to run AI/ML workloads on Cloud Run. I am interested in knowing  how these GPUs attach to Cloud Run instance. We have a few open models that use Nvidia T4 and  currently use a GPU attached GCE VM and contemplating moving to Cloud Run even though it is in preview. However the important factor for us is to be able to scale GPU. Is it a 1:1 mapping between a Cloud Run instance and GPU?  Also Cloud Run does not automatically scale the number of instances based on GPU utilization which is a big disadvantage in my opinion.

In summary I would like to know how autoscaling works with Cloud Run with GPU attached and how can i optimize GPU costs (Remember GPU is expensive).

Solved Solved
1 8 1,336
1 ACCEPTED SOLUTION

Today we dont support GPU usage based autoscaling, we plan to add that capability in future. As @knet said scaling is CPU based. However if you want to ensure more instances are scaled out, this can be done based on number of incoming requests. If you increase the concurrency of your Cloud Run service, you should see more requests being queued and in turn more instances being provisioned. 

View solution in original post

8 REPLIES 8

Hi @dheerajpanyam

Yes, you are correct. You can configure one GPU per Cloud Run instance and it does not automatically scale the number of instances based on GPU utilization. But, Cloud Run autoscaling still applies, it's an on demand service that automatically scales the number of instances to match workload demands.

Yes, GPU is expensive. But, instances of a Cloud Run service configured to use GPU can scale down to zero when not in use, optimizing cost efficiency. You may check the Cloud Run pricing for your reference.

Also, you can check this informative blog regarding Cloud Run GPU and LLMs for additional reference.

I hope the above information is helpful.

Curious to know if GPU metrics can be monitored when they are used in conjunction with CR service . Are they exposed as custom metrics (CPU, memory)? @ronnelg 

Yes, GPU usage and utilization is included in the Cloud Run services metrics.

Thanks @ronnelg probably I can use these metrics to scale GPU

 Hello @dheerajpan, from what I've seen, at least for basic chat apps, the number of GPU instances scales well with user traffic. As @ronnelg said, the scaling is based on CPU utilization; most likely, your application is using some CPU as well.

Thanks @ronnelg and @knet . What we are seeing with GPU attached to a VM is that the GPU has become a bottleneck not the VM's CPU or memory and the only way to resolve this issue is to attach a new GPU. 

Today we dont support GPU usage based autoscaling, we plan to add that capability in future. As @knet said scaling is CPU based. However if you want to ensure more instances are scaled out, this can be done based on number of incoming requests. If you increase the concurrency of your Cloud Run service, you should see more requests being queued and in turn more instances being provisioned. 

Thanks @sagarrandive  and others , closing this post