Solved: Issue with Custom Extractor: Unexpected Fields in ...

andressasoares · 01-22-2025 07:03 AM

Hi everyone,

I'm facing an issue with my processor. It's a custom extractor, but the JSON output contains fields typically associated with a specialized extractor, such as normalizedValue, as well as fields that are not defined in my schema (e.g., supplier_phone, supplier_name, etc.).

However, when I upload the same file to the Document AI UI, it only extracts the labels defined in my schema, as expected. This problem only occurs when using the API request.

I'm currently using batch processing. Is there any specific configuration I need to adjust to resolve this issue?

Thanks in advance for any help!

dawnberdan

Hi @andressasoares,

Welcome to Google Cloud Community!

The issue you're facing is due to how the Document AI system applies schemas differently in the UI and the API's batch mode. The UI usually applies the schema more strictly, while the API in batch mode might be more flexible unless you tell it not to be.

To fix this, you need to control the output fields in the API request settings. The main thing is to use the fieldMask parameter in documentOutputConfig, which lets you specify exactly which fields should be included in the response.

Here's how to solve the problem:

Identify Your Desired Fields: Make a list of only the fields you defined in your custom schema.
Construct the fieldMask: Create a comma-separated string listing the fully qualified names of the fields you identified. For example, if your schema has fields named invoiceNumber and invoiceDate , your fieldMask might look like this: "invoiceNumber,invoiceDate" . If you have nested fields, use dot notation (e.g., "customer.name,customer.address" ).
Include fieldMask in Your API Request: Modify your batch processing API request to include the fieldMask within the documentOutputConfig . The documentOutputConfig likely already exists; you just need to add or modify the fieldMask element within it.

Additionally, by clearly setting the fieldMask, you ensure that the API only returns the fields you've selected in your schema, just like in the Document AI UI. This stops the API from returning extra, unwanted fields. If you're still having issues, make sure the schema used in your API call exactly matches the one in your Document AI project.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

dawnberdan