I am using a fine tuned model of gemini-1.5-flash to process large amounts of data (worth noting same issue happens with the base model alone too). I have increased my quotas for "Generate content requests per minute per project per base model per minute per region per base_model" for the proper base model and region.
Strangely, now my usage doesnt appear in the "Quota" page on Google Cloud and my calls to the generate_content endpoint seem to be restricted to the base rate limit of 5 / min / region even after the quota increases.
The specific error I am getting now is 429 Online prediction request quota exceeded for gemini-1.5-flash. Please try again later with backoff. - which is different from the errors I got before the quote increase which referenced the specific quote I was limited by.
I would use the batch processing functionality, but it appears the fine tuned Gemini models do not support batch processing.
Attempted increasing quotas to avoid rate limit errors, resulted in unclear rate limit errors and the quotas page breaking.
Hi @camelmnts,
Welcome to Google Cloud Community!
According to this Nov 8, 2024 release note:
Batch prediction is available for Gemini in General Availability (GA). Available Gemini models include Gemini 1.0 Pro, Gemini 1.5 Pro, and Gemini 1.5 Flash. To get started with batch prediction, see Get batch predictions for Gemini.
Alternatively, you can also refer to this documentation for more information on batch processing and Gemini 1.5 Flash limitations.
With regard to Error code 429, if the number of your requests exceeds the capacity allocated to process requests, then error code 429 is returned. You may check this page for guidance on how to rectify this issue.
Additional key to remember that:
Quota is enforced on the number of concurrent tuning jobs. Every project comes with a default quota to run at least one tuning job. This is a global quota, shared across all available regions and supported models. If you want to run more jobs concurrently, you need to request additional quota for Global concurrent tuning jobs.
Furthermore, you may try these workarounds that may help you possibly resolve your concern:
Google Cloud Support can investigate your account's quota settings and usage patterns to identify the root cause. Provide them with:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hey Ruthseki,
I appreciate your detailed response, I am going to assume based on the release notes that you provided that fine tuned Gemini models are not supported for batch processing. Furthermore, when going to Console > Vertex AI > Batch Prediction > Create > Version Select (on gemini-1.5-flash Fine Tuned Model) > I get the error "Model is not supported for batch prediction." This is unfortunate because it would resolve the rate-limit issue entirely, but I will monitor the change-log, thank you for the link!
Quota debugging
Have you seen this behavior of a quota suddenly not reporting anymore? I am definitely above the 4 RPM limit by default per region per model, but defininitely not at the full 100 RPM on my quota. I dont see any other quota's across my entire project that are being used besides the one mentioned ("Online prediction requests per minute per region")
Any advice or further debugging tips are appreciated!
Hi @camelmnts,
Thank you for providing the image and the detailed breakdown of your testing and findings.
You need to raise the "Model is not supported for batch prediction" issue with Google Cloud Support or you can file a defect report. Provide them with the details, including the release notes, your screenshots, and the precise steps to reproduce the error.
When reporting the error, be sure to provide the screenshot so they see exactly what you are seeing.
You also need to report the fact that "Generate content requests per minute per project per base model per minute per region per base_model" is not updating properly and that it is showing 0% despite usage, and not matching your actual usage. Make sure to mention your region, base_model and that you have already increased the quota to 100 but the problem persisted.
I hope the above information is helpful.
Published the defect report here. https://issuetracker.google.com/issues/383821626
I am unfortunately unable to make an Organization on my account so I cant get support. I am also unable to transfer my fine tuned model to my company google account where I could setup an organization and get support.
Any ideas for another way to get support from GCP?
Hi @camelmnts,
In your case, I suggest keeping an eye on the defect report that you have published to see the update. I'm seeing that it's already assigned to our Engineering Team. Note that there’s no definite date as to when this will be implemented.
I hope the above information is helpful.
I am having same problem with the 429 errors. I do have an organization and support but the solution they provided was to apply for more quota, but the docs say that 1 is the maximum which we are already at? I am only sending text prompts there should be no reason why system is overloaded? This happens in studio and inside of Genkit. It is making it impossible to use the technology for building an application. They are no saying the only time to meet to discuss this issue is on Christmas Eve? Google please help!
Hi @camelmnts did you ever resolve this issue? What was the resolution?
No fix, see issue tracker here
From the documentation it seems as though using gemini-1.5-pro-002 or gemini-1.5-flash-002 models use Dynamic Shared Quota (DSQ), meaning the standard quota increase doesn’t apply. To guarantee higher throughput, Google requires Provisioned Throughput, which is in Preview and requires requesting access.