Document AI Custom Splitter Use Case

wbknapp · 10-18-2023 12:55 PM

Hi all, and thanks in advance for any guidance. I'm wondering if the Document AI Custom Splitter will be able to fulfill my requirements, and I'm hoping someone with experience using it will know the answer.

I have a large-scale scanning operation. We will scan a box of paper, roughly 2,000 pages, or 4,000 images, in a batch, and we would like to send those images to the Custom Splitter to be separated into individual documents. Typically, there will be only one document type in the entire batch, and the length of each document will vary, from as little as one image to as many as several hundred images. On average, however, each document will be around 25 images. The goal is for the Custom Splitter to return data in JSON format that will identify that the first document begins on page 1, the second document begins on page 8, the third document begins on page 23, and so on throughout the ~2,000 pages of the file sent for processing.

Based on my testing with the Custom Splitter so far, I have the following questions:

The data model requires a minimum of two values in the schema. How do I handle having only a single document type, i.e. a single schema value?
The splitter is not designed to handle logical documents of over 30 pages. I don't know if this means that only a 30 page or less file can be loaded to the Custom Splitter for processing, or that each document within the file needs to be 30 pages or less. Assuming it's the latter, what happens if a document is greater than 30 pages?
If the Custom Splitter will not meet my requirements, is there another Document AI tool that will?

Again, thank you for any help you can provide.

Poala_Tenorio

Document AI Custom Splitter is a powerful tool for document processing, but it does have certain limitations. Let's address your specific questions:

Handling Single Document Type:

You can handle a single document type by defining a schema with at least two values. In this case, you can create a schema with a general value that encompasses all your documents. For example, you can have a "Document Type" field in your schema, and even if all documents are of the same type, you can still use this field to specify that they are all "Type A" or something similar.

Handling Documents Over 30 Pages:

The 30-page limitation typically refers to individual documents. If a single document within your batch is greater than 30 pages, the Custom Splitter may not be able to accurately separate it. You'll need to ensure that each logical document to be split by the Custom Splitter is 30 pages or less. If you have documents longer than that, you may need to pre-process them by splitting them into smaller logical documents before feeding them to the Custom Splitter.

Alternative Document AI Tools:

If Custom Splitter's limitations pose challenges for your specific use case, you can consider other Google Cloud Document AI services:
a. Document Text Parser: This tool is suitable for extracting structured data from documents but doesn't split documents.
b. Form Parser: If your documents have a structured form-like format, this tool can be used to extract data from them.
c. Document Understanding AI (DocAI): It's a more advanced service that allows you to create custom parsers and process documents more flexibly. It may be a better fit if you require more control over document processing.

In summary, while Custom Splitter is a valuable tool for document separation, it has limitations regarding document length and schema requirements. If these limitations don't align with your needs, exploring other Document AI tools like Document Text Parser, Form Parser, or Document Understanding AI may provide a more suitable solution for your project.