Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to handle big pdf file (more than 15 pages) in Document AI to process?

Hello All,

I am creating a process to extract the data from pdf using Document AI, But the challenge for me is that all the pdfs are more than 15 pages. How I can split these pdfs and process it to extract the data using Document AI ?

0 1 906
1 REPLY 1

Hi @ankitdwivedi,

Handling large PDF files in Document AI requires a strategy that accounts for the limitations of processing single, large documents. Like most similar services, Document AI has size limits and performance considerations for individual document processing requests. Here’s how to address these challenges:

Use a PDF Splitting Tool: There are several tools available that can help you split large PDFs into smaller files. For example, you can use Google Cloud's Document AI Toolbox to split PDFs based on output from a Splitter/Classifier processor.

BatchProcessing: After splitting, process the smaller PDFs in batches. This improves efficiency and resilience. Don't send all the smaller PDFs in a single API call; instead, send them in manageable groups. The optimal batch size will depend on your Document AI service's quotas and performance characteristics. Experiment to find the sweet spot.

Asynchronous Processing: For large numbers of smaller PDFs, consider asynchronous processing. Submit each PDF's processing request to a queue (e.g., using a task queue like  Google Cloud Tasks) and process the results asynchronously. This prevents blocking your main application while waiting for each individual response.

Error Handling and Retries: Implement robust error handling. Network issues or Document AI service limitations can cause failures. Include retry mechanisms with exponential backoff to handle transient errors.

Data Aggregation: After processing all the chunks, you need to aggregate the results. This involves combining the extracted data from each smaller PDF into a unified representation of the original large document's information.

I hope the above information is helpful.