Hi
I am looking to deploy a Mixtral 8x7b model on a GCP VM, and then write a script to call the model locally and then expose an endpoint for that whole pipeline (not only the model).
The chatbot pipeline which is built using langchain will now use a locally hosted model rather than calling an API from OpenAI or other provider.
My question is, how will I handle multiple concurrent requests? If 100 users send a request to the endpoint, then how will the GPU handle those requests? I am new to devops, so just trying to figure this step out.
Hi @abe410,
A single VM has limited scaling capacity. If you set a maximum number of requests it should handle, you might still need more than one VM to meet demand. Ideally, you'd scale the number of instances with your model up or down based on request volume.
For API requests, leveraging Cloud Run offers an efficient solution due to its ability to auto-scale based on the volume of incoming requests. While Cloud Run excels at managing stateless interactions, it doesn't natively support GPUs, which is crucial for your model's inference tasks.
Secondly, for the GPU-demanding workload of the Mixtral 8x7b model, Compute Engine VMs equipped with NVIDIA GPUs, such as the Tesla A100, are ideal. These VMs are designed for compute-intensive operations like ML inference. However, unlike Cloud Run, scaling and load balancing across these VMs require manual setup or the use of managed instance groups.
An effective architecture might involve a scalable, serverless front-end on Cloud Run to process API requests, which are then routed to a Compute Engine backend for model inference, utilizing a queuing system or load balancer to distribute tasks efficiently across GPU-equipped VMs.
You may check these threads for more information:
I hope this helps. Thank you.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |