Hello, I’m reaching out because I’m experiencing an issue with data extraction from PDF files using Google Document AI. I’m working with a custom extractor, and I’ve noticed that when extracting fields that contain a mix of letters and numbers, the system often confuses the digit "0" (zero) with the uppercase letter "O."
This happens particularly in specific fields where such combinations are frequent, leading to errors in the extracted data. For example, a value like "A0B1" may be incorrectly extracted as "AOB1" or vice versa.
I’ve tried to adjust the custom model, but this issue persists. Has anyone faced a similar problem? Are there best practices, configurations, or post-processing techniques that could help resolve this?
Any advice or recommendations would be greatly appreciated!
Hi @andressasoares,
Welcome to Google Cloud Community!
The confusion between "0" (zero) and "O" (uppercase O) is a common issue in optical character recognition (OCR), especially with handwritten or low-quality scanned documents. Even though Document AI's custom extractors are strong, they can still get confused by this problem. Here are some ways to resolve it:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |