Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Pricing - Vertex hosted tuned model

Brieuc
New Member

Hello,

I am trying to figure out the cost of using a fine-tuned version of Gemini-flash-2.0. I believe all the information is contained here : https://cloud.google.com/vertex-ai/pricing

I understand that there are training costs, and then also inference costs

It says : Prediction pricing for tuned model endpoints are the same as for the base foundation model.
However, does this mean that having an endpoint deployed for my tuned model cost me each hour that the endpoint is up and running, even with no queries? Or is it only paid per token? 


This quote comes from the section in Pricing for AutoML models : "You pay for each model deployed to an endpoint, even if no prediction is made. You must undeploy your model to stop incurring further charges. Models that are not deployed or have failed to deploy are not charged." I think this could also be the case for Generative AI endpoints

So my question is : is the cost for inference ONLY per token, or do you also pay for a deployed endpoint by the hour?

(side note : the former could make sense as you can just take a server running the normal untuned gemini flash, and change the weights in the VRAM with the LoRA weights for the inference)

 

1 1 212
1 REPLY 1

Hello @Brieuc,

For fine-tuned Gemini models on Vertex AI, you incur both endpoint deployment costs and per-token inference charges. Like AutoML models, having an endpoint deployed (even idle) triggers hourly infrastructure fees until undeployed, separate from usage-based token costs. The pricing page confirms tuned models follow base model rates for inference tokens; only endpoint hosting remains billable. 

Some key steps you may try: 

  • Hourly endpoint costs apply while deployed (like AutoML) 
  • Per-token pricing matches base Gemini Flash rates 
  • Undeploy to stop infrastructure charges 

While LoRA weight swaps could theoretically enable token-only billing, Google currently charges for endpoint uptime regardless. Always undeploy unused models to optimize costs. 

Best regards,

Suwarna