Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Can we please get an offline token-counter so RAG chunkers can work reliably w/ embedding-exp-03-07

I'm quite proud of my "chunker" for my custom RAG, which has some elegant recursive mechanisms based on tiktoken:

` tokenizer = tiktoken.encoding_for_model("text-embedding-3-large") `

I'm wanting to upgrade to gemini-embedding-exp-03-07 but there's no way to count tokens without the 1000000x slowdown of online API calls involved.

Is there an official local-lib we can use which is guaranteed to count tokens properly which we can use, without exhausting our API limits and slowing-down all our code unnecessarily?

I'm asking, not just for myself, but for everyone who seriously wants to try and work with these models: "hello world" simple examples where token-counts are basically ignored are nice and all, but a production environment is going to depend on the existence of robust tools we can use... not actually having/releasing those things makes all these Gemini-* models a "non starter" for serious business use-cases...

Specifically - the "location=location" requirement of the lib's count_tokens method needs to be removed (or the entire `vertexai.init(project=project_id, location=location)` )

```

from vertexai.generative_models import GenerativeModel, Partimport vertexai
def count_tokens(project_id: str, location: str, model_name: str, prompt: str):
    """Counts the number of tokens in the given text."""
    vertexai.init(project=project_id, location=location)
    model = GenerativeModel(model_name)
    response = model.count_tokens(Part.from_text(prompt))
    return response.total_tokens

 

Solved Solved
0 3 400
1 ACCEPTED SOLUTION

Here is (or should be) the answer:pic_2025-03-28_08.48.05_1500.png

View solution in original post

3 REPLIES 3

Hi @cndg,

Welcome to Google Cloud Community!

There's currently no offline token counting method for Gemini embeddings like gemini-embedding-exp-03-07, making precise chunking for RAG applications difficult.

Here's what you can do:

  • You can use the count tokens API, which allows for accurate estimation of token usage before sending requests to the Gemini API, especially when dealing with mixed media inputs. You may refer to this discussion as reference.
  • As a less ideal workaround, use larger chunk sizes than preferred and handle potential truncation. This might work in some cases but requires careful testing.
  • Regularly check Vertex AI Embeddings documentation for updates.

Alternatively, you can submit a feature request so that our Engineering Team can help you further. Please note that I cannot specify when this enhancement will be implemented. For future updates, I recommend monitoring the tracker and release notes regularly.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Looks like a new offline feature to do this has just been released:

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/list-token

Here is (or should be) the answer:pic_2025-03-28_08.48.05_1500.png