Parallel operations with self hosted LLM

I am looking to deploy a Mixtral 8x7b model on a GCP VM, and then write a script to call the model locally and then expose an endpoint for that whole pipeline (not only the model).

The chatbot pipeline which is built using langchain will now use a locally hosted model rather than calling an API from OpenAI or other provider.

My question is, how will I handle multiple concurrent requests? If 100 users send a request to the endpoint, then how will the GPU handle those requests? I am new to devops, so just trying to figure this step out.

1 1 2,588

1 REPLY 1

never-displayed