Hello,
I am building a Custom Extractor in Document AI to extract data from a form that has changed over time and, therefore, has a number of different, but similar, layouts all containing the same information.
I've met the "full requirements" (i.e. 50 test and training instances of each label) and, for the most part, I am happy with my percentages.
However, I am stuck with a section that may or may not be tabular.
In some cases the form has a table that can contain zero or more records/rows:
In the case of tabular data, DocAI evaluates and auto-labels everything with a high degree of accuracy:
In other cases, though, the form has separate fields to capture the details for only a single record:
However, DocAI, seems incapable or evaluating and auto-labeling this as a single record and, instead, evaluates this as multiple, incomplete, records:
In the above example note that it groups section “7. Seams” as it's own parent label and separate from section “6. Shell” when, in fact, sections 6 & 7 are representative of a single record.
Because it’s possible, in the tabular case, for a form to have multiple records, I am using a Parent Label to capture the details. I was hoping that DocAI would be smart enough to figure out, given enough examples, that in some cases there is only a single record that is not in a tabular format.
I have uploaded and labeled a large number of documents (500+) in both formats, yet DocAI is still breaking the non-tabular format into multiple records rather than treating it as a single record.
So the question is, do Parent Labels have the capability to do this? i.e. if I just continue to add more examples, will it eventually figure it out, or should this actually be two different processors – one processor for tabular, multi-record form layouts and another processor for non-tabular, single-record form layouts?
I would really prefer to have a single processor, if possible.
This question is related to another post ( Document AI don't recognize parent label area correctly, and does it only on per line basis ), for which there was no answer.
Thank you!
Hi @StephenElmer1,
Welcome to Google Cloud Community!
Even if you add more examples to train your Custom Extractor Processor, using a single processor is not a recommended approach, as it will not accurately and reliably recognize parent label areas that span multiple lines with inconsistent layouts. The workaround is to use two separate processors.
With this, I suggest filing a feature request for this functionality of Document AI capable of recognizing a parent label area that spans multiple lines. This will also allow you to track the progress of your request, as it will be publicly available. Please note that I can't provide any details or timelines at this moment. For future updates, please keep an eye on the release notes for any updates or new features related to Document AI. Before filing, please check this documentation on what to expect after you've opened an issue.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Thanks @marckevin for responding to this. Even though I thought I had subscribed to the thread, I never got the notice that there was a reply.
I'm hoping to avoid having to use multiple processors. Partly because I've already uploaded and labeled a large number of documents. But also because I suppose I would then have to create a document classifier and then dig around for some workflow feature or bake it into my custom tooling to stitch these all together.
Poking around in various forums and documentation here and there, I get the impression that the Parent Label feature is supposed to have some capacity to pull data bits that aren't necessarily tabular. However, it isn't clear that it is able to do so. Anyway, I will submit a feature request. If you happen to have any other ideas, I'm all ears.