documentAI Labeling OCR issues: text select vs. bo... - Page 2

jhaus · 02-22-2024 11:08 AM

I need some help with best practices for labeling a PDF in DocAI. Based on my details, which option is best, or, is there an unforeseen better option? Situation as follows:

From Best Practices for Labeling:

Prefer labeling with the bounding box tool first. If that fails, use the select text tool. If the value of the label is not correctly detected by OCR, manually correct the value.

My forms frequently return two values when labeled with the bounding box tool. To resolve this I can do one of the following:

Manually correct the value (low difficulty).
Drag the bounding box in a very precise way to only capture one of the two values (moderate difficulty).
Switch to the select text tool to attempt to grab only the text correctly identified by OCR (high difficulty).

The example below (zoomed in on a small portion of the document) returns the following values:

"02 60" when labeling the entire area with the bounding box tool.
"02" when labeling a precise area with the bounding box tool.
"{A VERY LONG STRING OF CHARACTERS FROM ALL OVER THE DOCUMENT}" when labeling with the select text tool. This was essentially unusable, I believe it was because the OCR detected two overlapping values.

1.
entire area bounding box tool

2.

precise bound box tool

3.

select text tool

documentAI Labeling OCR issues: text select vs. bounding box