Hi,
I’m seeking advice on how to train a custom model in Document AI to achieve high accuracy for our use case. Here are the details of what we are working on:
1. Objective: We aim to use a custom splitter to divide financial reports into individual sub-reports based on a predefined set of labels. Our goal is to ensure that the model can correctly classify and split these documents.
2. Scope: We deal with a wide variety of document types, making it impractical to label all of them. Instead, we plan to experiment with a small set of labels initially. The expectation is that the model should produce a high confidence score for correctly labeled documents and a low score for unlabeled or incorrectly classified ones.
3. Data Preparation: For each label, we have approximately three variants of documents. Could you clarify the best practice for preparing the training data in this case? Should we split the data into 80% training and 20% testing for each variant? Also, could you provide a rough estimate of the number of documents required for each variant? Are there specific considerations we should keep in mind given the limited number of variants per label?
4. Confidence Score Usage: In the response, the confidence score is presented in two places:
- At the top-level entity
- Within each page reference
Which confidence score should we rely on for evaluation and decision-making? Are there scenarios where one is more reliable than the other?
{
"confidence": 0.48404133,
"page_anchor": {
"page_refs": [
{
"confidence": 0.99682796
}
]
}
}
Any guidance or recommendations you can provide would be greatly appreciated. We want to ensure that our training process is aligned with best practices and that we achieve reliable and consistent results.
Thank you.