Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Implementing Real-Time Invoice Data Extraction using Dataflow, Doc AI, and BigQuery

Hi Google Cloud Community,

I'm working on a project where we need to build a pipeline to extract data from invoice PDFs and store it in BigQuery. Below is the flow we have in mind, and I'd like to get some feedback on how we can implement this in real-time, and whether Dataflow is the best choice for this use case.

Flow Overview:

  1. GCS Bucket: This is where invoice PDFs (in .pdf format) will be uploaded.

  2. Dataflow: A pipeline that will trigger when a new file is uploaded to the GCS bucket in real time. Dataflow will use Doc AI to extract relevant data from the invoices and load it into BigQuery according to the data model shared in Section 5 of the TDD.

  3. BigQuery: This will be the destination where the structured invoice data will be stored.

Key Questions:

  1. Real-Time Processing:

    • How can we make this pipeline process the invoices in real-time as soon as they are uploaded to GCS? Are there best practices for achieving low-latency, real-time data processing in this flow?

  2. Doc AI Integration with Dataflow:

    • How can Doc AI be integrated with a Dataflow pipeline? Are there any resources or best practices for connecting Doc AI with Dataflow for seamless data extraction?

    • Does Doc AI support batch processing in Dataflow, and if so, how does that work for real-time processing?

  3. Parallel Processing:

    • Given the potentially large number of invoices, how will parallel processing work in Dataflow when extracting data from multiple invoices concurrently? Will Dataflow handle scaling automatically, or will we need to configure parallelism manually?

    • Is Dataflow the Right Tool?:

      • Is Dataflow the best choice for this use case, or would another service (e.g., Cloud Functions, Cloud Run, or Cloud Pub/Sub) be a better fit for real-time PDF processing and data extraction?

      • What are the pros and cons of using Dataflow in this scenario?

  4. Other Considerations:

    • Any other considerations or best practices when building this pipeline on Google Cloud?

    • It would be helpful if you provide any documentation for reference.

Looking forward to your insights and recommendations!

Thanks

Solved Solved
1 1 127
1 ACCEPTED SOLUTION

Hi @harshada2828 ,
Your design looks very solid! For real-time processing, I recommend triggering a Cloud Function when a file is uploaded to GCS, then pushing an event to Pub/Sub. From there, Dataflow (streaming mode) can subscribe to Pub/Sub, call Doc AI for extraction, and load results into BigQuery.
Dataflow scales automatically with proper settings, but you can tune worker limits for faster parallelism. Doc AI usually works in synchronous mode for small files — perfect for real-time flow.
You’re on the right track — combining Pub/Sub + Dataflow + Doc AI is a strong, scalable solution!

View solution in original post

1 REPLY 1

Hi @harshada2828 ,
Your design looks very solid! For real-time processing, I recommend triggering a Cloud Function when a file is uploaded to GCS, then pushing an event to Pub/Sub. From there, Dataflow (streaming mode) can subscribe to Pub/Sub, call Doc AI for extraction, and load results into BigQuery.
Dataflow scales automatically with proper settings, but you can tune worker limits for faster parallelism. Doc AI usually works in synchronous mode for small files — perfect for real-time flow.
You’re on the right track — combining Pub/Sub + Dataflow + Doc AI is a strong, scalable solution!

Top Labels in this Space
Top Solution Authors