Re: Slow RAG engine response with Vertex AI TypeSc...

Roksi

Hello everyone,

I'm experiencing a significant performance issue with my RAG engine implementation using the Vertex AI TypeScript library and I'm hoping to get some insights from the community.

Here's a summary of the situation:

Corpus Size: My RAG engine is working with a corpus of approximately 100,000 words.
Vertex AI TypeScript Library: When I send a query to the engine via my application using the Vertex AI TypeScript library, the response time is consistently slow, averaging around 30 seconds.
AI Studio: However, when I test the exact same query and setup within the AI Studio, the response is very fast, typically taking only about 2 seconds.

This large discrepancy in performance suggests that the issue might be with parameters.
Current parameters:
{
model: "gemini-2.5-flash",

generationConfig: {

maxOutputTokens: 512,

stopSequences: parameters?.stop,

temperature: 0.2,

topP: 0.9,

topK: 3,

},

......

}

Has anyone else encountered a similar issue? I'm trying to understand what could be causing such a delay.

Any help or suggestions on what to investigate would be greatly appreciated.

Thanks in advance!

marckevin

Hi @Roksi,

Welcome to Google Cloud Community!

Here are some suggestions that may help resolve the issue:

For latency-related issues in your deployment on Vertex AI, you can refer to this documentation on strategies to reduce latency, such as prompt and output length optimization, as well as streaming. As you mentioned the generation parameters, based on your current settings, further lowering the maxOutputTokens and temperature might have little effect on latency improvement. You could also consider other strategies, such as streaming or using system instructions for concise answers to limit the output.
Another possible reason is how your Vertex AI TypeScript code handles query processing. Double-check your Vertex AI TypeScript code to ensure it isn't sending the entire corpus in a single query and bypassing the core performance benefit of RAG, as sending such a massive payload will cause the entire context to be processed, resulting in longer processing times.
Implement caching if applicable, context caching helps lower the cost and speed up requests to Gemini by storing and reusing repeated content.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Slow RAG engine response with Vertex AI TypeScript library (~30s) vs. AI Studio (~1s)