Solved: Document AI data storage

mb2025 · 02-12-2025 10:21 PM

Hello,

I want to process DOCX/PDF files stored in GCP storage with Document AI full OCR processor and at run time based on whether certain text exists in the document or not, then show the results to the user on frontend. Do I need to setup a separate BigQuery/Vector DB for the interim embeddings generated by the processor or will it automatically store it in GCP buckets? Based on the total # of tokens in the document, I also want to show the user the cost of analysing the document. (Note - I do not wish to provide any Search functionality to the user - so not sure if I still need a vector DB?)

Any help on this is greatly appreciated.

Thank you.

cassandramae

Hi @mb2025,

Welcome to Google Cloud Community!

Document AI processes documents and returns JSON. The OCR processor will not automatically store embeddings in a BigQuery or vector database or back to your GCS bucket since it returns a structured JSON response containing the processed document information.

For example, the code in Cloud Functions examines the JSON response from Document AI. From this, you can search for specific text strings. To handle the result, if your document meets the criteria you can extract the relevant information from the JSON and format this information into a user-friendly structure. Also, the JSON response from Document AI will not directly provide you with a token count and you will need to calculate the token count yourself.

In addition, BigQuery and vector databases are not required since you mentioned that you do not need to provide search funtionality or large-scale analytics. For more information, you may check this documentation.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

cassandramae