Solved: [Document AI] Difficulty with Checkbox Detection i... - Page 2

KT-K · 07-18-2024 01:44 AM

Hello,

I’m trying to extract text and checkboxes from a handwritten survey in PDF format. The survey consists of 4 series, each with 100 sets.

For the first series, the accuracy is around 80%, with only a few checkboxes incorrectly detected or missed.

However, for the remaining series, most checkboxes aren’t detected at all. When I label the checkboxes, they’re supposed to be highlighted in blue, but they aren’t.

Initially, I thought the document might not be clear enough, but upon comparing the good and bad checkboxes, they seem equally clear.

What should I do?

Thanks.

ruthseki

Hi @KT-K,

The lack of a gray label suggests that your template's checkbox definitions are likely not accurate. The model can't properly align your annotations with the actual checkbox elements in the document.

Moreover, adding more samples without the gray label won't help the model understand what a checkbox is. It will simply learn to treat those regions as empty or undefined. The key is to ensure your existing samples have correctly labeled checkboxes with the gray label. This teaches the model what a checkbox looks like.

Here are the factors Influencing checkbox detection:

1. Template Accuracy:

Precise Bounding Boxes: Ensure the bounding boxes for your checkboxes are tight and accurate.
Correct Annotation: Use the correct annotation type (e.g., "CHECKBOX") in your template.
Consistency: Maintain consistent checkbox definitions across all series.

2. Training Data Quality:

Clarity and Consistency: Use clear, well-scanned documents with consistently marked checkboxes. Avoid blurry or smudged checkboxes.
Variety: Include a diverse set of checkboxes with different sizes, shapes, and even partially filled-in checkboxes.

3. Document Structure:

Spacing and Alignment: Ensure checkboxes are well-spaced and not too close to other text elements.
Text Proximity: If checkboxes are too close to text, the model might struggle to distinguish them.

4. Model Training:

Sufficient Data: Provide enough training data to cover the various checkbox styles and patterns in your surveys.
Training Duration: Train the model for a sufficient amount of time to allow it to learn effectively.

By focusing on template accuracy, data quality, and proper labeling, you'll significantly improve your checkbox detection within your Document AI model.

I hope this clarifies your concern.

View solution in original post

[Document AI] Difficulty with Checkbox Detection in Custom Template-Based Model