Hi,
I've created a page with a webhook exists as a cloud function. Through the cloud function, I'm sending the user prompt to the fine-tuned model in vertex ai and return the response to the dialog flow cx. But the latency of returning the response to the dialog flow cx is more. What are the ways to reduce the time taken to get the solution from the llm?
Thanks in advance..,
Hi @Brindha what I would suggest is to use the built-in generators that are available in Dialogflow CX that basically are doing the same task but with better performance: https://cloud.google.com/dialogflow/cx/docs/concept/generative/generators
Best,
Xavi
Thanks for the response @xavidop . But I want the response from the custom fine-tune model trained with the custom dataset. Our fine-tuned custom model is not showing in the generators also. Is there any approach if we change the NLU type to get the response quicker?
Hi @Brindha to interact with your custom models you will need to call a webhook that calls your fine-tuned model! where did you deploy that model? When you tested it in isolation, is it still slow?
Hi @xavidop . I deployed the model under create tune and distill model. I deployed the model to the endpoint and I used it in my cloud function. Through it, the response coming back to the dialogflow cx.
What is the delays you are having?
I'm finding the delay on getting the prediction from the tuned model to dialog flow cx as a bot response. Is there any ways to quick up the process?
I am not sure if there is a way of changing the machine/compute that is using for inference
There is a problem with using the Generators, they are not hugely performant and also the way the Dialogflow responses are parsed i.e waits for all responses to return before sending back a response means that there is a lag. We've noticed this in our IVR system. There is no way to return one message before going off to a page with a generator or even more simply a response on the same page, e.g "let me look into that for you..."
This is a known problem and not one which looks set to be fixed soon (AKA not on Googles roadmap, last time I asked Oct 2023).
You can stream partial responses over voice but that didn't work for us.
thanks for the input @adrianthompson ! in that case, what I would suggest is to use a webhook and an LLM that has better performance, like the ones available on Vertex AI or a llama deployed in a machine
To reduce latency:
1. Optimize the model for speed.
2. Deploy on a low-latency platform like Vertex AI.
3. Batch processing for multiple requests.
4. Cache frequently requested responses.
5. Consider asynchronous processing.
6. Use a CDN for global users.
7. Monitor and optimize performance continuously.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |