I have pdf documents containing invoices . A single pdf can have multiple invoices in it. Invoices can take one or more pages in a pdf. I want to split/group each invoice based on page number given. Page number is written in each pdf(for e.g: page 1 of 2,page 2 of 2) which can be used in annotation/training of model
Example:
Input: A pdf of 5 pages. Page 1 contains invoice 1. Page 2 contains invoice 2. Page(3-5) contains invoice 3.
Output: 3 different pdfs for each invoice.
(Note: I can not separate them based on invoice number because it can be missing in some invoices)
sample pdf
Welcome to Google Cloud Community!
Document AI is a powerful tool that can be used to automate a variety of document processing tasks which includes splitting a large document to multiple documents.
For your specific need, you can use Custom Splitter.
Custom Splitter is designed to be used to split composite documents (documents made up of multiple classes) into a number of single class documents by identifying each logical document, such as invoices. You can create custom splitters tailored to your specific documents. You'll train and evaluate these splitters using your data. This processor can identify various classes of documents based on a user-defined set of classes.
You can check the quickstart guide on how to use Document AI to create and train a custom splitter that splits and classifies your invoice documents. You can also follow this step by step guide for this task directly in the Google Cloud console.
I hope the above information is helpful.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |