Document AI Model Misclassifying Documents Despite...

tootsieroll · 01-05-2025 09:16 PM

Hello,

I am experiencing issues while training and testing a custom model using Document AI. Despite having provided a sufficient amount of labeled training data, the model frequently misclassifies documents during the testing phase.

Questions:

Are there any additional best practices I should follow to improve the model's performance?
Could this be related to the Document AI's underlying architecture or configuration settings?
How can I debug or fine-tune the model to enhance accuracy?

I’d appreciate any insights, suggestions, or resources to address this issue and improve the model's performance.

Thank you!

MJane

Hi @tootsieroll,

Welcome to Google Cloud Community!

I understand that you are encountering issues on training and testing a custom model using Document AI. There are several reasons why your model misclassify documents despite having sufficient labeled training data.

Here are possible causes of misclassify documents and how to address them :

Data Quality - Make sure that there are no labeling errors or inconsistencies. Ensure labels are consistent by following clear guidelines and having multiple reviewers check the data.

Data Diversity - Include different types of documents and formats in your training data to match what the model will see in real use.

Model Configuration - Check if certain types or features of documents are causing problems repeatedly.

Regular Retraining - Monitor the models performance with new training data and continue retraining to help the model improve overtime.

For more information about Custom Document Classifier you can read this documentation.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

tootsieroll

Thank you for your response. I truly appreciate it. I have a few additional questions:

How should I prepare the training set and split the data between the training and test sets, given that we have many variations for each label?
Should we aim for an 80/20 split for each variation?
Could you provide a rough estimate of the number of documents required for each variation?

Your guidance on these points would be greatly appreciated. Thanks again.

Document AI Model Misclassifying Documents Despite Adequate Training Data