Re: Custom extractor character box detection

tomislavcosic · 10-30-2024 04:58 AM

Hello!

I'm using Document AI Custom Extractor (custom model, not Generative AI) for extracting data from scanned paper forms. Those forms contain fields in character boxes and it seems extraction quality suffers because of it. It seems that the lines between characters cause issues (so characters such as |, 1, and I are sometimes incorrectly detected). Also, parts of the fields are sometimes repeated in extracted value. It also seems as if unnatural spacing between characters (because they are in boxes) causes the model to either only extract a part of the field value or to capture parts of the field value multiple times (e.g. value 123456 in character boxes is detected as 12323456 or something like that).

I understand Custom Extractor does not have built-in character box detection support, so our best effort was to also run the documents through Enterprise Document OCR Processor which supports detection in character boxes. Then we use custom code to somehow try to find values from fields in character boxes in Enterprise Document OCR Processor output and replace those values in Custom Extractor output. That helps, but it is expectedly janky and unreliable.

We need to read specific values from a custom paper form, so Custom Extractor seems to be the only model type which provides that option. At the same time, adjusting the design of paper forms to get rid of the character boxes is not an option. Is there a way to handle character boxes in Custom Extractor, or some other way to improve the final output quality?

I'll gladly answer any questions if I failed to provide some useful information.

Thank you!

MarvinLlamas

Hi @tomislavcosic,

Welcome to Google Cloud Community!

I understand that you are encountering issues when extracting data from scanned paper forms with character boxes using Document AI Custom Extractor.

Here are some possible ways that may help to improve the accuracy of your extracted data:

Human-AI collaboration model: You may implement a manual review step to ensure that the critical fields of these documents are reviewed by humans, which will improve the accuracy of the data.
Training and data enrichment: You may also consider training your documents on a dataset, as this could improve the accuracy of the data.
Run Custom Extractor: By creating a custom extractor for your documents, you can train and evaluate the data, potentially improving its accuracy.

I hope the above information is helpful.