Agreed Cloud Run with GPU is a big plus which opens doors to run AI/ML workloads on Cloud Run. I am interested in knowing how these GPUs attach to Cloud Run instance. We have a few open models that use Nvidia T4 and currently use a GPU attached GCE VM and contemplating moving to Cloud Run even though it is in preview. However the important factor for us is to be able to scale GPU. Is it a 1:1 mapping between a Cloud Run instance and GPU? Also Cloud Run does not automatically scale the number of instances based on GPU utilization which is a big disadvantage in my opinion.
In summary I would like to know how autoscaling works with Cloud Run with GPU attached and how can i optimize GPU costs (Remember GPU is expensive).
Solved! Go to Solution.
Today we dont support GPU usage based autoscaling, we plan to add that capability in future. As @knet said scaling is CPU based. However if you want to ensure more instances are scaled out, this can be done based on number of incoming requests. If you increase the concurrency of your Cloud Run service, you should see more requests being queued and in turn more instances being provisioned.
Hi @dheerajpanyam,
Yes, you are correct. You can configure one GPU per Cloud Run instance and it does not automatically scale the number of instances based on GPU utilization. But, Cloud Run autoscaling still applies, it's an on demand service that automatically scales the number of instances to match workload demands.
Yes, GPU is expensive. But, instances of a Cloud Run service configured to use GPU can scale down to zero when not in use, optimizing cost efficiency. You may check the Cloud Run pricing for your reference.
Also, you can check this informative blog regarding Cloud Run GPU and LLMs for additional reference.
I hope the above information is helpful.
Curious to know if GPU metrics can be monitored when they are used in conjunction with CR service . Are they exposed as custom metrics (CPU, memory)? @ronnelg
Yes, GPU usage and utilization is included in the Cloud Run services metrics.
Thanks @ronnelg probably I can use these metrics to scale GPU
Hello @dheerajpan, from what I've seen, at least for basic chat apps, the number of GPU instances scales well with user traffic. As @ronnelg said, the scaling is based on CPU utilization; most likely, your application is using some CPU as well.
Today we dont support GPU usage based autoscaling, we plan to add that capability in future. As @knet said scaling is CPU based. However if you want to ensure more instances are scaled out, this can be done based on number of incoming requests. If you increase the concurrency of your Cloud Run service, you should see more requests being queued and in turn more instances being provisioned.
Thanks @sagarrandive and others , closing this post