Hi
I am looking to deploy a Mixtral 8x7b model on a GCP VM, and then write a script to call the model locally and then expose an endpoint for that whole pipeline (not only the model).
The chatbot pipeline which is built using langchain will now use a locally hosted model rather than calling an API from OpenAI or other provider.
My question is, how will I handle multiple concurrent requests? If 100 users send a request to the endpoint, then how will the GPU handle those requests? I am new to devops, so just trying to figure this step out.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |