How to control the segmentation and extraction of ...

Dhruv016 · 07-10-2023 11:05 AM

I am using Document AI API OCR, what I am trying to do is extract the text of the document in a formatted manner so that using the output I can use regex to get the result.

For example if a document has the fields like Registration Number: 12345, Name: XYZ both on seperate lines, I wanted to get the output in two lines.

But when I ask for the API to return the text "Registration Number" is on 1st line Name: on 2nd line then 12345 on 3rd and XYZ on 4th. Even if I can get 12345 on the 2nd line it will work out for me. How can I fix this segmentation on v1 of documentai.

Please help me out on how do I fix the segmentation of the output.

@ErnestoC @kvandres

Dhruv016

@kvandres Please help me out with this problem. I tried the OCR of Form Parser and I got a much better output for "document.text". Is there any way to get the same kind of output or give instructions to Document OCR to give the out in a better layout since Form Parser doesnt understand many languages that I have to work on.

kvandres

Good day @Dhruv016,

Welcome to Google Cloud Community!

You can try creating a Custom Document Extractor, this will allow you to extract the specific requirements or information from your documents since it is specially trained to understand the structure of your documents but please note you will need to train, evaluate and deploy your the model if you create a Custom Document Extractor: https://cloud.google.com/document-ai/docs/workbench/build-custom-processor however it may be also helpful to your case since it supports more languages that you might be looking for: https://cloud.google.com/document-ai/docs/languages#custom_processors
If your preferred language is still not supported, I would suggest that you file a feature request using this link: https://cloud.google.com/support/docs/issue-trackers
If you are requesting thru API in Document AI OCR, you can also try setting the fields in OCR configuration that is suitable for your case (e.g. Advanced option field will allow you to choose the best suitable layout algorithm): https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/ProcessOptions#ocrconfig

Hope this helps!

How to control the segmentation and extraction of structured data using Document AI API OCR