Document AI Custom Splitter Use Case - Page 2

wbknapp · 10-18-2023 12:55 PM

Hi all, and thanks in advance for any guidance. I'm wondering if the Document AI Custom Splitter will be able to fulfill my requirements, and I'm hoping someone with experience using it will know the answer.

I have a large-scale scanning operation. We will scan a box of paper, roughly 2,000 pages, or 4,000 images, in a batch, and we would like to send those images to the Custom Splitter to be separated into individual documents. Typically, there will be only one document type in the entire batch, and the length of each document will vary, from as little as one image to as many as several hundred images. On average, however, each document will be around 25 images. The goal is for the Custom Splitter to return data in JSON format that will identify that the first document begins on page 1, the second document begins on page 8, the third document begins on page 23, and so on throughout the ~2,000 pages of the file sent for processing.

Based on my testing with the Custom Splitter so far, I have the following questions:

The data model requires a minimum of two values in the schema. How do I handle having only a single document type, i.e. a single schema value?
The splitter is not designed to handle logical documents of over 30 pages. I don't know if this means that only a 30 page or less file can be loaded to the Custom Splitter for processing, or that each document within the file needs to be 30 pages or less. Assuming it's the latter, what happens if a document is greater than 30 pages?
If the Custom Splitter will not meet my requirements, is there another Document AI tool that will?

Again, thank you for any help you can provide.