Hi,
I'm making a cURL request to VertexAI, using the chat-bison-32k model. The context message I'm sending is quite large, around 44,000 characters, but the response time for a single request is consistently between 30 to 35 seconds. How can I optimize this to achieve a response time of 4 to 5 seconds?
The use case is to generate SQL query for a question asked to the model.
I am experiencing a latency of 2 minutes. I haven't been able to capture the full output but input tokens is ~36000 and output is approx 5000 tokens.