Solved: [Document AI] Difficulty with Checkbox Detection i...

KT-K · 07-18-2024 01:44 AM

Hello,

I’m trying to extract text and checkboxes from a handwritten survey in PDF format. The survey consists of 4 series, each with 100 sets.

For the first series, the accuracy is around 80%, with only a few checkboxes incorrectly detected or missed.

However, for the remaining series, most checkboxes aren’t detected at all. When I label the checkboxes, they’re supposed to be highlighted in blue, but they aren’t.

Initially, I thought the document might not be clear enough, but upon comparing the good and bad checkboxes, they seem equally clear.

What should I do?

Thanks.

ruthseki

Hi @KT-K,

The lack of a gray label suggests that your template's checkbox definitions are likely not accurate. The model can't properly align your annotations with the actual checkbox elements in the document.

Moreover, adding more samples without the gray label won't help the model understand what a checkbox is. It will simply learn to treat those regions as empty or undefined. The key is to ensure your existing samples have correctly labeled checkboxes with the gray label. This teaches the model what a checkbox looks like.

Here are the factors Influencing checkbox detection:

1. Template Accuracy:

Precise Bounding Boxes: Ensure the bounding boxes for your checkboxes are tight and accurate.
Correct Annotation: Use the correct annotation type (e.g., "CHECKBOX") in your template.
Consistency: Maintain consistent checkbox definitions across all series.

2. Training Data Quality:

Clarity and Consistency: Use clear, well-scanned documents with consistently marked checkboxes. Avoid blurry or smudged checkboxes.
Variety: Include a diverse set of checkboxes with different sizes, shapes, and even partially filled-in checkboxes.

3. Document Structure:

Spacing and Alignment: Ensure checkboxes are well-spaced and not too close to other text elements.
Text Proximity: If checkboxes are too close to text, the model might struggle to distinguish them.

4. Model Training:

Sufficient Data: Provide enough training data to cover the various checkbox styles and patterns in your surveys.
Training Duration: Train the model for a sufficient amount of time to allow it to learn effectively.

By focusing on template accuracy, data quality, and proper labeling, you'll significantly improve your checkbox detection within your Document AI model.

I hope this clarifies your concern.

View solution in original post

ruthseki

Hi @KT-K,

Welcome to Google Cloud Community!

There's an opportunity to fine-tune your Document AI custom template-based model to achieve even better checkbox detection precision for your handwritten surveys. To optimize performance, here are some strategies that could enhance accuracy:

1. Leverage the Newer Foundation Model:

Document AI offers different foundation model versions for custom extractors with generative AI. Consider switching to the latest version which includes advanced features like checkbox detection.

2. Refine Training Data:

Ensure your training data includes a good representation of checkboxes from all series, not just the first one. This helps the model generalize better.
If specific checkbox types are missed consistently, add more examples of those types to the training data.

3. Address Labeling Issues:

Double-check the labeling process for the problematic checkboxes. Make sure they are enclosed within a bounding box and assigned the correct label type (e.g., "Checkbox").
In rare cases, there might be rendering issues causing highlighting problems. Try re-importing the document or exporting and re-uploading it.
If the issue persists, it could be a platform bug; contact Google Cloud support.

4. Consider Alternative Approaches:

If the accuracy improvement is limited, explore alternative approaches:

Use a separate pre-trained model specifically designed for form processing, such as the Form Parser.
If the layout is consistent, explore template-based extraction without relying on checkbox detection. You can define areas on the form where a check mark might indicate a positive response.

By trying these steps, you should be able to improve the checkbox detection accuracy in your Document AI model for your handwritten surveys.

I hope the above information is helpful.

KT-K

Hello @ruthseki ,

Thank you for your reply.

Since the foundational model and form parser are unable to detect my checkboxes, I am using the Template-Based model.

Moreover, I believe the primary reason is that most checkboxes do not display a grey label after I click "select Text" during labeling.

If I import more samples without grey labels, would that improve the results?

Additionally, what factors influence checkbox detection?

Thank you.

A

ruthseki

Hi @KT-K,

The lack of a gray label suggests that your template's checkbox definitions are likely not accurate. The model can't properly align your annotations with the actual checkbox elements in the document.

Moreover, adding more samples without the gray label won't help the model understand what a checkbox is. It will simply learn to treat those regions as empty or undefined. The key is to ensure your existing samples have correctly labeled checkboxes with the gray label. This teaches the model what a checkbox looks like.

Here are the factors Influencing checkbox detection:

1. Template Accuracy:

Precise Bounding Boxes: Ensure the bounding boxes for your checkboxes are tight and accurate.
Correct Annotation: Use the correct annotation type (e.g., "CHECKBOX") in your template.
Consistency: Maintain consistent checkbox definitions across all series.

2. Training Data Quality:

Clarity and Consistency: Use clear, well-scanned documents with consistently marked checkboxes. Avoid blurry or smudged checkboxes.
Variety: Include a diverse set of checkboxes with different sizes, shapes, and even partially filled-in checkboxes.

3. Document Structure:

Spacing and Alignment: Ensure checkboxes are well-spaced and not too close to other text elements.
Text Proximity: If checkboxes are too close to text, the model might struggle to distinguish them.

4. Model Training:

Sufficient Data: Provide enough training data to cover the various checkbox styles and patterns in your surveys.
Training Duration: Train the model for a sufficient amount of time to allow it to learn effectively.

By focusing on template accuracy, data quality, and proper labeling, you'll significantly improve your checkbox detection within your Document AI model.

I hope this clarifies your concern.

[Document AI] Difficulty with Checkbox Detection in Custom Template-Based Model