Solved: Low F1 Score with Document AI on Diverse Invoice a...

steven_tan10 · 11-03-2024 06:53 PM

Issue in Detail:

I'm experiencing difficulties in achieving high accuracy when using Google Document AI for parsing invoices and expenses. I am currently working with custom datasets that include approximately 1,500 images/documents per parser (Invoice Parser and Expense Parser). My goal is to reliably extract key data from these documents, but due to the diverse layouts and formats in my dataset, the F1 scores remain low, ranging between 0.75 and 0.85. This performance issue becomes even more apparent when testing with new images or documents; in these cases, predictions are frequently inaccurate or sometimes miss important fields.

What I've Already Tried:

I have invested significant time in annotating around 1,500 images/documents for each parser, following recommended steps to ensure consistent labeling.
I also followed the official annotation guidelines as recommended here to improve data quality and maintain annotation standards.
The dataset I’m using contains a mix of scanned documents, soft copies, regular images, and photo images, aiming to improve model robustness. However, despite these efforts, the F1 scores are not improving beyond 0.75-0.85, and model predictions on new document layouts remain inconsistent.

I would appreciate any insights or recommendations for improving model accuracy, particularly with datasets that contain various layouts. Thank you!

dawnberdan

Hi @steven_tan10,

Yes, performing data preprocessing or data cleaning before importing documents into Document AI is a key step in improving accuracy, especially when working with diverse layouts. By normalizing the layout, correcting distortions, and ensuring that the input data is clean, you give Document AI a better chance of making accurate predictions.

Once you have preprocessed your documents, you can then input them into Document AI for further analysis and extraction. This will likely lead to better results in terms of accuracy and consistency.

I hope the above information is helpful.

View solution in original post

dawnberdan

Hi @steven_tan10,

Welcome to Google Cloud Community!

It's great that you're actively working to improve your Document AI model's accuracy. Here are some strategies to address the challenges you're facing with diverse layouts and inconsistent predictions:

Data Augmentation and Preprocessing:

1. Layout Normalization: Consider using image processing techniques to normalize the layout of your documents. This could involve:

Skew Correction: Correcting any tilting or rotation in the images
Deskewing: Adjusting the perspective of the image to make it appear more rectangular.
Cropping: Removing unnecessary margins or whitespace.

2. Data Augmentation: Generate synthetic variations of your existing documents to increase the diversity of your training data. This can include:

Rotation: Rotating images slightly.
Scaling: Changing the size of the images.
Noise Injection: Adding random noise to the images.

3. OCR Preprocessing: Prior to inputting documents into Document AI, it's beneficial to use a reliable OCR engine to enhance text extraction accuracy. This step can help mitigate problems caused by blurry or low-quality images.

You can also refer to the following documentation for additional insights and guidance:

Expense Parser - This document contains detailed information on processors offered by Document AI.

Invoice Parser - This document contains detailed information on processors offered by Document AI.

Managing Processor Versions - Instructions on handling different processor versions for compatibility.

In addition, I came across an article/blog that covers Data Augmentation in Document AI which could be helpful for you.

I hope the above information is helpful.

steven_tan10

Apologize for late response, thank you @dawnberdan for your suggestion. From your explanation, If I understand correctly, we need to perform data preprocessing or data cleaning first before importing the data into Document AI, is it correct?

dawnberdan

Hi @steven_tan10,

Yes, performing data preprocessing or data cleaning before importing documents into Document AI is a key step in improving accuracy, especially when working with diverse layouts. By normalizing the layout, correcting distortions, and ensuring that the input data is clean, you give Document AI a better chance of making accurate predictions.

Once you have preprocessed your documents, you can then input them into Document AI for further analysis and extraction. This will likely lead to better results in terms of accuracy and consistency.

I hope the above information is helpful.

Low F1 Score with Document AI on Diverse Invoice and Expense Layouts: Seeking Improvement Tips