Hi,
Since 2nd of April we see a strong quality decrease on gemini-2.0-flash. We haven't changed our code base nor prompting. We use the chatcompletion api from the OpenAI SDK. As we have been facing so many hallucinations, we had to stop our business. The model starts even inventing itself functions that do not exist.
We tried two different endpoints: europe1 (belgium) and europe9 (Paris).
The problem clearly sits in the gemini model, as we did a direct comparison with the gpt-4.o. So exactly the same configuration, prompt, etc:
OpenAIs´ prompt adherence is more or less perfect, whereas gemini is simply crap.
Does anyone else have similar problems?
Andreas
Hi @co-brainer,
Welcome to Google Cloud Community!
If you're seeing hallucinations and incorrect function calls, it might be worth checking if there have been any updates or modifications to Gemini's model behavior that could be affecting your use case.
Here are some potential solutions that might address your issue:
Experiment with Different Prompting Techniques: While you say your prompts haven't changed, it's worth trying variations to see if you can mitigate the problem. Try:
To learn more about effective prompt techniques, you may visit the Five Best Practices for Prompt Engineering and Prompt Design Strategies.
Adjust Model Parameters: Experiment with temperature, top_p, and frequency/presence penalties. Lowering the temperature can sometimes reduce hallucinations, but it may also make the responses more rigid.
Consider Other Gemini Models: If possible, test other Gemini variants to see if they exhibit the same problems. However, performance differences may vary across use cases.
Document Everything: Keep a detailed log of the prompts you're using, the responses you're getting, and any changes you make to your code or prompting strategies. This will be invaluable for troubleshooting.
If the issue persists, I suggest contacting Google Cloud Support as they can provide more insights to see if the behavior you've encountered is a known issue or specific to your project.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi ibaui,
thanks for your reply, but this does not really help.
Please read again carefully: We have no change in our code base - nor in our prompt neither in our application implementation. But all of a sudden the model changed it behavior and starts to underperfom.
What defenitely proves to be true is, that the size of the context window has a huge impact on performance. The system prompts, that we currently need for our application are definetely less than 7K tokens, which already seems to be to big to get reliable results. As soon as we reduce it to 2-3K the prompt adherence is increasing dramatically. So why is google doing marketing here with 1Mio token context window - I do not get this ..
My team has been having the same exact issues from us-central. The performance went way down since April 2nd with no clear indicator why. Even went back several commits where we where getting around 85% correctness on our tests and now is down to around 70%. So it definitely isn't a code issue.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |