Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dialogflow CX : delay in returning the response from fine tuned model

Hi,
    I've created a page with a webhook exists as a cloud function. Through the cloud function, I'm sending the user prompt to the fine-tuned model in vertex ai and return the response to the dialog flow cx. But the latency of returning the response to the dialog flow cx  is  more. What are the ways to reduce the time taken to get the solution from the llm?

Thanks in advance..,

0 10 1,201
10 REPLIES 10

Hi @Brindha what I would suggest is to use the built-in generators that are available in Dialogflow CX that basically are doing the same task but with better performance: https://cloud.google.com/dialogflow/cx/docs/concept/generative/generators

Best,

Xavi

Thanks for the response @xavidop  . But I want the response from the custom fine-tune model trained with the custom dataset. Our fine-tuned custom model is not showing in the generators also. Is there any approach if we change the NLU type to get the response quicker?

Hi @Brindha to interact with your custom models you will need to call a webhook that calls your fine-tuned model! where did you deploy that model? When you tested it in isolation, is it still slow?

Hi @xavidop . I deployed the model under create tune and distill model. I deployed the model to the endpoint and I used it in my cloud function. Through it, the response coming back to the dialogflow cx. 

 

What is the delays you are having?

I'm finding the delay on getting the prediction from the tuned model to dialog flow cx as a bot response. Is there any ways to quick up the process?

I am not sure if there is a way of changing the machine/compute that is using for inference

There is a problem with using the Generators, they are not hugely performant and also the way the Dialogflow responses are parsed i.e waits for all responses to return before sending back a response means that there is a lag. We've noticed this in our IVR system. There is no way to return one message before going off to a page with a generator or even more simply a response on the same page, e.g "let me look into that for you..." 

This is a known problem and not one which looks set to be fixed soon (AKA not on Googles roadmap, last time I asked Oct 2023).
You can stream partial responses over voice but that didn't work for us.

thanks for the input @adrianthompson ! in that case, what I would suggest is to use a webhook and an LLM that has better performance, like the ones available on Vertex AI or a llama deployed in a machine

To reduce latency:

1. Optimize the model for speed.

2. Deploy on a low-latency platform like Vertex AI.

3. Batch processing for multiple requests.

4. Cache frequently requested responses.

5. Consider asynchronous processing.

6. Use a CDN for global users.

7. Monitor and optimize performance continuously.