Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Gemini 1.5 Flash: API Results Worse than AI Studio?

Hey y'all. Pretty new to Gemini, so pardon ahead of time if I'm just being an idiot. I do that sometimes.

Problem: gemini-1.5-flash response from API calls are consistently much lower quality than response from AI Studio, and this is blocking me from migrating from GPT to Gemini

Summary:  I've been absolutely blown away by the quality and consistency of the responses I've gotten from gemini-1.5-flash in AI Studio so far. I'm at the point where I would love to move my project from OAI over to gemini as a result! Unfortunately, when I run the exact same prompt (copy-paste) in my API, the results are pretty terrible. Specifically, the response the API sends me it is not following the directions correctly. it ignores constraints, and returns substantially less comprehensive analysis, that seems to degrade in quality/consistency as the result is processed.

What I've tried so far:

  • Verified both modalities are pointing to the same model (gemini-1.5-flash-latest)
  • Verified that the temperature is exactly the same as I've set in AI studio (0.25)
  • Duplicated other parameters (see example below) from AI-Studio to my API

Example Prompt & Responses (Link)

(Possibly) Relevant Context:

  • I only have one Google Cloud account, and one project set up. It is currently set up as Paid
  • The request is ~8k tokens, and is sent as stringified JSON (text) that includes basic instructions, source data, and a JSON structure for gemini to use in it's response (~7k tokens).
  • The request is self-contained, as It references no objects outside the text in the JSON I'm sending.
  • My API call is a simple http POST request, as follows:

{
"contents": {
"role": "user",
"parts": {
"text": <my-prompt>
}
},
"generation_config": {
"temperature": 0.25,
"top_p": 0.95,
"top_k": 64,
"max_output_tokens": 8192,
"response_mime_type": "application/json"
}
}

 

2 23 13.7K
23 REPLIES 23

I agree with you.
Especially with image-based requests, the OCR is clearly not as accurate as it could be.

Hi,

Did you resolve the problem? and if yes, could you please share the solution with us?

@Ilia sadly, I've not been able to achieve a resolution, and in fact have only run into more issues since 1.5 went from preview to production. It's been quite frustrating, and seems like there is so much capability that is being nerfed in the API. 👎🏼

I have found my own little sweet spot for application building. I use the following (note that my application building is devops related so I do not need the creativity of a story telling chatbot so my attributes reflect that):

temperature: 0.1

top_k: 10

top_p: 0.1

I also provide very clear instructions using the "system_instructions" option. System instruction prompts "should" be prioritized over user prompts. Additionally, keep track of the context window size as things will get very unpredictable after passing that. I have multiple agents created handling different tasks in the conversation and keeping the chat_history for the main agent free of all the data that gets punted around the worker agents which makes it all work very well and provide very accurate results consistently.

Thanks for the advice, @jamiabailey !

I've extensively played around with the params and system instructions, but I've not worked with agents or caching before. I've wanted to try this, but thought it was only available in chat , where my use case is a single REST call. I will definitely double-check this! In the meantime, were there any tutorials/docs/guides you found helpful for doing this within Vertex AI?

Not sure if needed, but just in case, here's a bit more context to my use case:

I'm sending a single prompt via a REST call to 1.5 Flash totaling 2-3k tokens. I include multiple (3-5) step instructions, along with the text for Gemini to analyze, and a JSON structure for the Gemini to respond with. (Once I receive the response, I process the json data on the server, but outside of google). As a simplified example: Here is the text of document A, and here is the text of document B. 1. First, extract some information  from Document A (i.e. keywords); 2. Run some analysis that compares docs A and B;  3. Send me a structured response with the results of the previous steps.  << (These instructions are consistent on every call, only the document text is changing.)

The issues I'm facing:

  1. Gemini completely ignores boundaries that I pass to it. Example: Extract no more than 20 keywords from document A; Gemini would respond with 150 keywords from both document A and document B
  2. Highly inconsistent analysis. Example: Follow A, B and C criteria to perform some analysis  and respond with a list of items that includes and independent numeric metric assigned each item, following D, E, and F protocol. Gemini responds with highly variant metrics to each API call I make. Sometimes I get independent ratings, sometimes a stack ranking, sometimes it appears to be an arbitrarily-decrementing values, etc.
  3. Gemini won't parse information in excess of 4 or 5 words from most documents I pass to it, due to recitation issues. I've given up on this altogether, as there's not much that can be done when the model itself is nerfed

I think what is most frustrating about all of this is that responses were previously excellent and consistent within AI Studio, prior to the model switching from preview to production.

couple things.

1. How are your prompts structured? Are they system instructions or just fed in as standard user prompts? Are they just a few lines of what to do or do you provide examples for it to follow as well? I spend more time on providing examples as part of the instructions then the instructions themselves as I have found examples are far more effective. For example, if you are getting more then 20 words in the response when you shouldn't be, you could put some examples that ask whatever question, and in the response, show the reasoning that gemini took to come to a conclusive answer and in that reasoning have it throw out answers that are over 20 characters and rejected for a correct under 20 character response. Examples are everything in prompt engineering. At least in my experience.

2.  Agents would help here. You could create a validation agent whos enter purpose in life is to validate that the first agent provided a correctly structured response or not. And feed it back to the first agent to be redone if not. This is nothing more then asking one model, that has a defined set of instructions,  a question, taking the response, and feeding it to a another model with a completely separate set of instructions to check if it broke any rules, and providing a yes no response to that, if yes it broke the rule, feed it back to the original agent to be redone, if no, pass the agent on to UI to be seen by the user. This is a highly simplified example but you get the point. I have a mix of agents for the mere reason that bogging down one agent with a huge set of instructions to do a bunch of things gets me back inconsistent results. So I maintain a master chat history between the user and main agent, do all sorts of magic between the other agents, log all of that so I can inspect it if needed, but keep the chat history clean of it. Only thing that gets logged in the chat history is the original question and the final/correct response. I even have an agent responsible for formatting the display data back to the user. Sometimes it might go in a table, or in a chart/graph and it is taught how to structure a chart.js response that I use JS to build the chart.js data dynamically so nice interactive charts pop on the screen. Agents are everything when it comes to building apps with predictability. 

@jamiabailey this is really helpful!  Thank you so much for providing such a detailed response 🙂 

I'll update the thread with any solutions once I get back to working on this over the next week or so!

Hey. Have you got any updates?

Hey, sorry i forgot to update this! Moving some details to system_instructions did help with accuracy, but I never was able to get consistent results without breaking down my prompts into 2-3 smaller prompts and chaining them together. I never tried using agents because doing so basically would have eliminated the cost advantages I was getting by running on the flash 1.5 model, but I do believe going that route is still a viable option in certain cases... just not mine. 🙂 I'd also previously tried most everything listed in @shaikhsharmeen4  response below, so while that doesn't help me directly, I think it's an absolutely fantastic checklist to work through if you haven't already.

I do still use it in my app for at least one other, less complex workflow, and I'm getting good results there. But, ultimately I decided Gemini wasn't cut out for the wider use case that I was hoping for.

It's also worth noting I'm definitely not deeply experienced in this area, so I concede the issue may be on my end. Just decided it wasn't worth the continued trouble in this instance.

Thanks for the reply. I wanted to ask, where do I find system_instructions
option?

--
Evgeniy Mokich

Super simple. I'm using a https request, but same idea applies in python, js, etc. If you wind up figuring out any tricks to get stuff working consistently, lmk!

 

{
  "contents": {
    "role": "user",
    "parts": {
      "text": <prompt>
    }
  },
  "system_instruction":
  {
    "parts": [
      {
        "text": <instructions>
      }
    ]
  },
  "generation_config": {
	"temperature": <temp>,
    "max_output_tokens": 8192,
    "response_mime_type": "application/json"
   }
}   

 

 

Thanks. Unfortunately the system_instructions option has been absolutely
useless to me. In fact, when I inserted my instructions there, the
responses only got worse

Your issue with Gemini 1.5 Flash responses being inconsistent between the API and AI Studio is likely due to subtle differences in environment configurations, parameter handling, or token limits that aren’t as obvious.

Here are a few things you can try:

  1. Match API Configurations Exactly:

    • Ensure that all API parameters such as temperature, top_p, top_k, and max_output_tokens are identical between AI Studio and your API calls.
    • Pay attention to the structure of your API request; some fields (like formatting or token limits) may not match up perfectly between environments.
  2. Check Token Limit & Output Truncation:

    • Your prompt is large (~8k tokens), and Gemini Flash's response can be limited by the max_output_tokens parameter. You may want to ensure it’s not cutting off valuable content by checking both input and output token counts.
    • If your input size is close to the model’s token limit, try reducing it slightly to allow more headroom for processing the full response.
  3. Content Formatting:

    • Gemini models can be sensitive to the structure and format of your prompts. Ensure that your JSON structure for Gemini is correctly formatted in both the API and AI Studio.
    • You can test by simplifying your prompt slightly to see if the issue persists when asking for more basic results.
  4. Try Without response_mime_type:

    • Although you're using response_mime_type: "application/json", check if removing this or using text/plain improves the response quality. Sometimes, forcing a specific MIME type may alter how the model processes the request.
  5. Monitor Response Payload:

    • Capture and review the full response payload in the API to see if there’s any difference in structure or format compared to AI Studio. The degradation in quality could be related to the API response handling or truncation during the POST request.
  6. Debug API Logs:

    • Use Google Cloud's logging services to monitor the API request and response. This will help you identify any discrepancies in the actual data sent versus what's expected.
  7. Request More Fine-Grained Control:

    • In some cases, the underlying infrastructure might be handling requests differently. If AI Studio uses more advanced internal parameters (not exposed through the API), you could contact Google Cloud support to check for further guidance or hidden parameters.

Try these steps, and see if they narrow down the inconsistency.

I have this same issue. The model just don't follow my system prompt correctly as it should!

I'm testing Gemini 1.5 Flash trough API on a RAG application with a simple system prompt and a 15 token user prompt with 0.9 temperature, the answers are so short seems almost rude, sometimes of only a single word.

Seems giving him in the user prompt an example of the required style works for one interaction, but next even with provided with memory returns to previous behavior.

I was also considering switching from Claude 3.5 Haiku to Flash but this way it's impossible.

did any of you found solution to this ?

Yes, I am also very interested. 

I see big differences in the output in Google AI Studio and the output from the API even though the prompt and the configuration of the model and even the code are identical.

As shaikhsharmeen4  is mentioning then AI Studio might use more advanced internal parameters (not exposed through the API).

If anyone have ideas on how to solve this please let us know.


Same here. About one third of my results when doing API calls are different. In my case, I compare API calls with Vertex AI on Google Console. Results on Google Console are more in line with System Instructions.

After looking into this further, I still don’t have a solution, but I wanted to share my findings in case they help someone get closer to one.

In my case, the results from Google AI Studio and the API (whether using Vertex AI API or Google Generative AI API) are exactly the same. What I found interesting is that Google AI Studio seems to have a fixed Top-K of 40, while I set mine to 1—yet the results remain identical. This might be because I'm using a temperature of 0.

On the other hand, in Vertex AI on Google Cloud, Top-K is fixed at 1 (matching my API settings), but about one-third of the results are different, which is an issue because that one third of results is more accurate when compared with API calls. I initially thought there might be an issue with my API, where despite setting Top-K to 1, the model could be using a different value. However, after testing with various Top-K values, I noticed only slight differences—none of which pertained to the one third of results that are different from Vertex AI on Google Cloud.

After looking into this further, I still don’t have a solution, but I wanted to share my findings in case they help someone get closer to one.

In my case, the results from Google AI Studio and the API (whether using Vertex AI API or Google Generative AI API) are exactly the same. What I found interesting is that Google AI Studio seems to have a fixed Top-K of 40, while I set mine to 1—yet the results remain identical. This might be because I'm using a temperature of 0.

On the other hand, in Vertex AI on Google Cloud, Top-K is fixed at 1 (matching my API settings), but about one-third of the results are different, which is an issue because that one third of results is more accurate when compared with API calls. I initially thought there might be an issue with my API, where despite setting Top-K to 1, the model could be using a different value. However, after testing with various Top-K values, I noticed only slight differences—none of which pertained to the one third of results that are different from Vertex AI on Google Cloud.

 

Glad to see you're getting solid results! Gemini is extraordinary
cost-effective, which has force me in a way to continue to experiment. I've
definitely seen things improving in terms of consistency lately, albeit
perhaps not to the extent you're describing. I plan to start testing the
2.0 flash api over the next 1-2 weeks on one of my products and hopefully
will see things continue to improve.

--

Dan Schlung | Founder


<>

Hi @dan7 One of the things that worked for me was the keeping the text prompt first and then putting all the file links for OCR use case.