documentAI Labeling OCR issues: text select vs. bo...

jhaus · 02-22-2024 11:08 AM

I need some help with best practices for labeling a PDF in DocAI. Based on my details, which option is best, or, is there an unforeseen better option? Situation as follows:

From Best Practices for Labeling:

Prefer labeling with the bounding box tool first. If that fails, use the select text tool. If the value of the label is not correctly detected by OCR, manually correct the value.

My forms frequently return two values when labeled with the bounding box tool. To resolve this I can do one of the following:

Manually correct the value (low difficulty).
Drag the bounding box in a very precise way to only capture one of the two values (moderate difficulty).
Switch to the select text tool to attempt to grab only the text correctly identified by OCR (high difficulty).

The example below (zoomed in on a small portion of the document) returns the following values:

"02 60" when labeling the entire area with the bounding box tool.
"02" when labeling a precise area with the bounding box tool.
"{A VERY LONG STRING OF CHARACTERS FROM ALL OVER THE DOCUMENT}" when labeling with the select text tool. This was essentially unusable, I believe it was because the OCR detected two overlapping values.

1.
entire area bounding box tool

2.

precise bound box tool

3.

select text tool

Poala_Tenorio

Based on the situation you've described, it seems like manually correcting the value would be the most efficient option, as it has low difficulty and ensures accuracy. This method allows you to quickly correct any errors or inconsistencies in the OCR output without much hassle.

While dragging the bounding box in a very precise way to only capture one of the two values is a viable option, it may require more time and effort compared to manually correcting the value, especially if you have to do it repeatedly throughout the document.

Using the select text tool to attempt to grab only the text correctly identified by OCR might be challenging, as you mentioned it resulted in an unusable string of characters due to overlapping values. This method seems to have a higher difficulty level and may not provide the desired outcome in this scenario.

Therefore, considering efficiency and effectiveness, manually correcting the value appears to be the best option for labeling your PDFs in DocAI in this particular situation.

jhaus

@Poala_Tenorio wrote:
manually correcting the value would be the most efficient option

Thank you for taking the time to share this answer, it's very helpful. Regarding manual correction, you cite this as the most efficient option. Would this option also be best for improving accuracy, i.e. improving our F score?

documentAI Labeling OCR issues: text select vs. bounding box