Hi @hzham,
Welcome to Google Cloud Community!
You're facing a common issue with the new Document AI OCR versions (v1.2 and v1.3). Your fine-tuned extractors are performing worse than non-fine-tuned ones, and older models (v1.1) are outperforming the new ones. Here's a breakdown of potential reasons for this discrepancy:
Here are the troubleshooting steps you can take:
Additionally, you can refer to this document to learn how to fine-tune models for more accurate data extraction from your documents.
I hope the above information is helpful.
Hi,
First of all, thank you for the reply. For your suggestions:
Relabel: I have already done that this is the result from the relabeled data set with the latest OCR.
Data Review: I have used the same data-set with almost the same labelling -of course there might be slight differences between borders of the labels but I do not reckon that would produce such a major impact-
Increase Data: If new models do no require more training data to perform with the same levels as the old model this cannot be the cause of the problem. If there is such a case, yes this might work.
Fine-tuning parameters: I have done some experiments and increasing the training steps worked like a charm and now its f1 score is much higher than the old one. Is there any documentation or a reference to how to find best parameters combinations in Document AI?
Best,
Hamit
Just for your information, in my case, the fine-tuned models are better than the pre-trained models for both v1.2 and v1.3. It improved from 0.75 and .78 to .83 after fine-tuning. I have more than 200 labeled documents.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |