Hello!
I'm using Document AI Custom Extractor (custom model, not Generative AI) for extracting data from scanned paper forms. Those forms contain fields in character boxes and it seems extraction quality suffers because of it. It seems that the lines between characters cause issues (so characters such as |, 1, and I are sometimes incorrectly detected). Also, parts of the fields are sometimes repeated in extracted value. It also seems as if unnatural spacing between characters (because they are in boxes) causes the model to either only extract a part of the field value or to capture parts of the field value multiple times (e.g. value 123456 in character boxes is detected as 12323456 or something like that).
I understand Custom Extractor does not have built-in character box detection support, so our best effort was to also run the documents through Enterprise Document OCR Processor which supports detection in character boxes. Then we use custom code to somehow try to find values from fields in character boxes in Enterprise Document OCR Processor output and replace those values in Custom Extractor output. That helps, but it is expectedly janky and unreliable.
We need to read specific values from a custom paper form, so Custom Extractor seems to be the only model type which provides that option. At the same time, adjusting the design of paper forms to get rid of the character boxes is not an option. Is there a way to handle character boxes in Custom Extractor, or some other way to improve the final output quality?
I'll gladly answer any questions if I failed to provide some useful information.
Thank you!