Issue with Data Extraction Using Document AI: Conf...

andressasoares · 01-07-2025 04:36 AM

Hello, I’m reaching out because I’m experiencing an issue with data extraction from PDF files using Google Document AI. I’m working with a custom extractor, and I’ve noticed that when extracting fields that contain a mix of letters and numbers, the system often confuses the digit "0" (zero) with the uppercase letter "O."

This happens particularly in specific fields where such combinations are frequent, leading to errors in the extracted data. For example, a value like "A0B1" may be incorrectly extracted as "AOB1" or vice versa.

I’ve tried to adjust the custom model, but this issue persists. Has anyone faced a similar problem? Are there best practices, configurations, or post-processing techniques that could help resolve this?

Any advice or recommendations would be greatly appreciated!

dawnberdan

Hi @andressasoares,

Welcome to Google Cloud Community!

The confusion between "0" (zero) and "O" (uppercase O) is a common issue in optical character recognition (OCR), especially with handwritten or low-quality scanned documents. Even though Document AI's custom extractors are strong, they can still get confused by this problem. Here are some ways to resolve it:

Improve Input Document Quality:

Source Material: If possible, obtain higher-quality source PDFs. Crisp, clear scans or digitally created PDFs are less likely to produce OCR errors. Low resolution, blurry scans, or images with poor contrast significantly increase the chance of this type of error.
Preprocessing: Before feeding PDFs to Document AI, consider using image preprocessing techniques to enhance contrast and sharpness. There are various open-source libraries (like OpenCV) and online tools that can help improve the quality of the scanned images.

Fine-tune Your Custom Extractor Model:

Additional Training Data: A highly effective solution is to supply more training data to your custom extractor. Include examples that focus on distinguishing between "0" and "O" in contexts where confusion tends to occur. Make sure your training data covers a wide range of documents with different fonts, styles, and possible variations of the "0" and "O" characters.

Clear Labeling: Be thorough when labeling documents during the training process. Ensure that instances of "0" and "O" are clearly marked to help the model distinguish between them more effectively.
Step-by-Step Training: Train and test your model in stages. After each training session, review the results carefully, paying special attention to areas where the "0"/"O" confusion happens. Update your training data and adjust the settings based on the mistakes you find.
Custom Code: Create custom code to review the extracted data and use advanced logic to solve the confusion. This could include comparing character shapes, looking at the context (like nearby characters or words), using a confidence score (if available from Document AI),

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Issue with Data Extraction Using Document AI: Confusing "0" (Zero) with "O" (Letter O)