Solved: Re: [Document AI] Checkbox/Form Design

KT-K · 08-02-2024 02:14 AM

Hello,

I’m planning to extract text and checkboxes from a handwritten survey in PDF format. Previously, the training samples were ineffective because they didn’t display the grey label when I selected “Text” during labeling. I believe this was due to the original form not being recognized.

So, I’ve decided to design a new form/survey using MS Word. This time, the checkboxes are larger, but some of them can’t be detected. I understand that blurring may occur after printing, filling, scanning to pdf, and rendering with Document AI. This is unavoidable and beyond my control.

Could you advise on the optimal size or conditions for the checkboxes to achieve better results? Is there anything else I can do?

FYI, I am using Template-Based model.

dawnberdan

Hi @KT-K,

You're correct! Achieving consistent sizing across various formats can be challenging. To enhance OCR accuracy and minimize discrepancies, consider these tips for optimizing your print and scan settings:

Print Settings:

Resolution: Opt for a high resolution (at least 300 DPI, preferably 600 DPI) to ensure sharp, clear text that aids OCR software in distinguishing characters.
Font: Choose simple, readable fonts like Arial or Times New Roman. Avoid decorative or highly stylized fonts that might cause OCR errors.
Font Size: Use a reasonably large font size (12pt or larger) to prevent blurriness.
Paper Type: Select high-quality paper that resists bleeding and distortion.
Printing Method: Laser printers typically offer cleaner and more consistent prints compared to inkjet printers.

Scan Settings:

Resolution: Maintain a high resolution (at least 300 DPI) for accurate character recognition.
Color Mode: For text-only documents, use black and white mode to avoid issues with OCR accuracy.
Brightness and Contrast: Adjust these settings to ensure the text is clear and contrasts well with the background.
Document Type: If your scanner has options for different document types (text, photo, etc.), choose the "text" setting for optimal OCR results.
Scan Mode: Use "duplex" mode if available to scan both sides of a document simultaneously, saving time and ensuring uniform scanning.

Additional Tips:

Minimize Background Noise: Ensure a clean background when scanning or photographing documents to reduce interference.
Straighten Documents: Use an automatic document feeder (ADF) or manually align documents to avoid skewed text.
Use OCR Software: Consider using specialized OCR software (like ABBYY FineReader, Adobe Acrobat Pro, or Google Drive’s OCR feature) for better results.
Experiment with Settings: Test different print and scan settings to find the best combination for your documents and OCR software.

By fine-tuning your print and scan settings, you can greatly improve OCR accuracy and manage challenges related to document size and image quality.

I hope the above information is helpful.

View solution in original post

dawnberdan

Hi @KT-K,

Welcome to Google Cloud Community!

The absence of a gray label indicates that the checkbox definitions in your template might be incorrect. As a result, the model struggles to align your annotations with the actual checkbox elements in the document.

Additionally, adding more samples that lack the gray label won’t improve the model’s understanding of what a checkbox is. Instead, the model will likely learn to see those areas as empty or undefined. The crucial step is to make sure your existing samples have correctly labeled checkboxes with the gray label. This helps the model learn what a checkbox looks like.

Moreover, increasing the checkbox size is a good move, but it's not a complete fix. Challenges like blurring, ink quality, and handwriting variability still impact detection.

To optimize checkbox design for accurate extraction with Document AI, consider the following:

Minimum Size: Use at least 12mm x 12mm (0.47 inches x 0.47 inches) to ensure the model can clearly identify the checkbox.
Larger Sizes: Increasing to around 15mm x 15mm (0.59 inches x 0.59 inches) can improve recognition, but avoid making checkboxes too large as it may affect the form layout.
Bold Outline: A bold outline of at least 1pt (1/72 inch) enhances contrast and helps the model recognize the checkbox more easily.
Simple Shapes: Squares and circles are easiest for the model to detect.
Fill Color (Optional): A solid fill color, like black or gray, can improve visibility, but avoid colors too similar to the background.

Additional tips include ensuring your Template-Based model is well-aligned with the form's layout to facilitate accurate extraction. Experiment with different checkbox sizes and designs, and test various training datasets. Analyzing the results will help you identify the optimal combination for your specific form and use case.

I hope the above information is helpful.

KT-K

Hi @dawnberdan

Thank you for the information. It is quite difficult to set 12-15mm because the size differs between MS Word, the printed copy, and the scanned copy.😂

Anyway, as you said, “Challenges like blurring, ink quality, and handwriting variability still impact detection.”

Do you have any suggestions for other stages, like printout settings, scan settings, etc.?

dawnberdan

Hi @KT-K,

You're correct! Achieving consistent sizing across various formats can be challenging. To enhance OCR accuracy and minimize discrepancies, consider these tips for optimizing your print and scan settings:

Print Settings:

Resolution: Opt for a high resolution (at least 300 DPI, preferably 600 DPI) to ensure sharp, clear text that aids OCR software in distinguishing characters.
Font: Choose simple, readable fonts like Arial or Times New Roman. Avoid decorative or highly stylized fonts that might cause OCR errors.
Font Size: Use a reasonably large font size (12pt or larger) to prevent blurriness.
Paper Type: Select high-quality paper that resists bleeding and distortion.
Printing Method: Laser printers typically offer cleaner and more consistent prints compared to inkjet printers.

Scan Settings:

Resolution: Maintain a high resolution (at least 300 DPI) for accurate character recognition.
Color Mode: For text-only documents, use black and white mode to avoid issues with OCR accuracy.
Brightness and Contrast: Adjust these settings to ensure the text is clear and contrasts well with the background.
Document Type: If your scanner has options for different document types (text, photo, etc.), choose the "text" setting for optimal OCR results.
Scan Mode: Use "duplex" mode if available to scan both sides of a document simultaneously, saving time and ensuring uniform scanning.

Additional Tips:

Minimize Background Noise: Ensure a clean background when scanning or photographing documents to reduce interference.
Straighten Documents: Use an automatic document feeder (ADF) or manually align documents to avoid skewed text.
Use OCR Software: Consider using specialized OCR software (like ABBYY FineReader, Adobe Acrobat Pro, or Google Drive’s OCR feature) for better results.
Experiment with Settings: Test different print and scan settings to find the best combination for your documents and OCR software.

By fine-tuning your print and scan settings, you can greatly improve OCR accuracy and manage challenges related to document size and image quality.

I hope the above information is helpful.