Solved: Document AI import dataset in a different processo...

oleks_vasyliev · 07-10-2024 04:29 PM

Hello.

As I can see, Document AI processors interface had "Export data set" functionality. But I cannot find a way how to import this exported dataset to different processor. I found some info in this doc - https://cloud.google.com/document-ai/docs/label-documents?authuser=1#import-labels but link "Sending a processing request to an existing processor" broken and show 404.

Context: I need copy schema and already labled documents in custom processor from invoice processor.

Thanks

dawnberdan

Hi @oleks_vasyliev,

Welcome to Google Cloud Community!

The link to "Sending a processing request to an existing processor" redirects to the 404 page on my end as well. But I found the correct link which you can access here. Although Document AI processors offer an "Export dataset" functionality, there isn't a straightforward way to import the entire dataset, including the schema and labeled documents, into a different processor.

However, you can consider a couple of workarounds:

1. Auto-labeling with a Foundation Model Processor:

Import your documents into the new processor and enable "Import with auto-labeling".
This will use a Pre-trained foundation model processor to automatically assign labels to your documents based on the existing schema.
Manually review and correct these auto-labeled documents before using them to train the new processor.
For more information, refer to the documentation: Custom extractor mechanisms

2. Recreate the Schema and Manually Import Labeled Documents:

You can define a new schema in the target processor to closely match the schema of the original processor. Refer to Document AI Schemas for instructions on defining Schemas, including data types and field types. This will assist you in recreating the schema from your original processor in the new one.
Next, manually import the labeled documents from the original dataset into the new processor. To do this, you may follow the Label documents guide that explains how to use the labeling tool to manually label imported documents in your new processor.

I hope the above information is helpful.

View solution in original post

dawnberdan

Hi @oleks_vasyliev,

Welcome to Google Cloud Community!

The link to "Sending a processing request to an existing processor" redirects to the 404 page on my end as well. But I found the correct link which you can access here. Although Document AI processors offer an "Export dataset" functionality, there isn't a straightforward way to import the entire dataset, including the schema and labeled documents, into a different processor.

However, you can consider a couple of workarounds:

1. Auto-labeling with a Foundation Model Processor:

Import your documents into the new processor and enable "Import with auto-labeling".
This will use a Pre-trained foundation model processor to automatically assign labels to your documents based on the existing schema.
Manually review and correct these auto-labeled documents before using them to train the new processor.
For more information, refer to the documentation: Custom extractor mechanisms

2. Recreate the Schema and Manually Import Labeled Documents:

You can define a new schema in the target processor to closely match the schema of the original processor. Refer to Document AI Schemas for instructions on defining Schemas, including data types and field types. This will assist you in recreating the schema from your original processor in the new one.
Next, manually import the labeled documents from the original dataset into the new processor. To do this, you may follow the Label documents guide that explains how to use the labeling tool to manually label imported documents in your new processor.

I hope the above information is helpful.

Cope99

I encountered the same issue and found a way to import labeled datasets into a custom Document AI processor. Here’s the documentation: https://cloud.google.com/document-ai/docs/create-dataset#import

Here is the important section :

When you select Import, Document AI imports all of the supported file types as well as JSON Document files into the dataset. For JSON Document files, Document AI imports the document and converts its entities into label instances.

In summary, if you select the folder where it was exported in your bucket, it will apply the same label in the documents. The field names must be the same between the original to the new dataset. I had to do two different import, one for test and the other for training.

Document AI import dataset in a different processor