Re: Google ADK and large datasets

jodison · 04-15-2025 02:34 AM

I have been experimenting with the Google Agent Development Kit and I absolutely love it. One thing I've been struggling with is how to operate on data that is too large to fit into a prompt. For example, I have an agent that uses a tool to get the schema of the large dataset and then builds code to process the dataset. I would like to be able to have the large dataset loaded into the execution environment without it being passed around in a prompt. How do I accomplish this? As a workaround, I have an agent that updates any python code generated to load the source data from a bucket, but this seems hacky. Ideally I could pass a reference to the bucket object in the session data and have it loaded into memory inside an execution environment.

SuwarnaKale

Hello @jodison,

To avoid loading large datasets into prompts, store references (e.g., GCS URIs or Big Query tables) in the agent’s session state instead of raw data. Lazy-load the data in the execution environment only fetching it when needed to minimize memory overhead. Tools should access the data via these references, leveraging caching or chunked processing for scalability. This approach keeps prompts lean, adheres to ADK’s architecture, and supports datasets of any size. For seamless integration, extend ADK’s Session class to manage storage and retrieval automatically.

Some of the key steps you may try:

Store references in session data
Lazy-load in execution environments
Process data via tools without prompt pollution
Optimize with caching/chunking for performance

Some alternatives I would try:

For tabular data, use BigQuery federated queries (direct SQL in tools).
For files, stream via google-cloud-storage client.

I hope the above answer helped!

Best regards,

Suwarna