Re: How to insert JSON along with PDF into Documen...

malimasood · 04-19-2023 07:29 AM

Hi,

Our usecase is to process the PDF documents from Document AI process and pass the JSON file along with the PDF to the document warehouse. I am using contentwarehouse.CreateDocumentRequest function, the function works well if I only supply the PDF document, but if I process the file from the document and push the JSON along with the PDF the it gives an error saying the following:

File "<ipython-input-63-d034c5106990>", line 1, in <module>
runfile('C:/Users/HP/Downloads/test simple.py', wdir='C:/Users/HP/Downloads')

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/HP/Downloads/test simple.py", line 110, in <module>
process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)

File "C:/Users/HP/Downloads/test simple.py", line 88, in process_document_sample
doc=documentai.types.Document(docDictionary)

File "D:\Anaconda3\lib\site-packages\proto\message.py", line 566, in __init__
"Unknown field for {}: {}".format(self.__class__.__name__, key)

ValueError: Unknown field for Document: _pb

Following is the code Snippet:

def process_document_sample(
project_id: str,
location: str,
processor_id: str,
file_path: str,
mime_type: str,
field_mask: str = None,
😞
# You must set the api_endpoint if you use a location other than 'us'.
opts = storage.Client(project_id)

client = documentai.DocumentProcessorServiceClient()

# The full resource name of the processor, e.g.:
# projects/{project_id}/locations/{location}/processors/{processor_id}
name = client.processor_path(project_id, location, processor_id)

# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()

# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

# Configure the process request
request = documentai.ProcessRequest(
name=name, raw_document=raw_document, field_mask=field_mask
)

result = client.process_document(request=request)

# return result
# TODO(developer): Uncomment these variables before running the sample.
# project_number = 'YOUR_PROJECT_NUMBER'
# location = 'YOUR_PROJECT_LOCATION' # Format is 'us' or 'eu'
#print(result.document.entities)

# Create a Schema Service client
import json
with open('test.json','w') as f:
json.dump(documentai.Document.to_dict(result.document),f)
# documentai.Document.to_dict(result)
document_schema_client = contentwarehouse.DocumentSchemaServiceClient()

# The full resource name of the location, e.g.:
# projects/{project_number}/locations/{location}
parent = document_schema_client.common_location_path(
project=config['project_number'], location=config['location']
)

# Create a Document Service client
document_client = contentwarehouse.DocumentServiceClient()

# The full resource name of the location, e.g.:
# projects/{project_number}/locations/{location}
parent = document_client.common_location_path(
project=config['project_number'], location=config['location']
)
#print(result.document._pb)
docDictionary = result.document.__dict__
doc=documentai.types.Document(docDictionary)
# Define Document
document = contentwarehouse.Document(
# raw_document_file_type=1,
display_name="60.pdf",
document_schema_name=schema_URI,
inline_raw_document=open('60.pdf','rb').read(),
#plain_text=str(result.document)
cloud_ai_document=doc
)

# Define Request
create_document_request = contentwarehouse.CreateDocumentRequest(
parent=parent, document=document
)

# Create a Document for the given schema
response = document_client.create_document(request=create_document_request)
# print(response)

process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)

I have read all the documentation, but couldn't find why the dictionary is not being picked by the object.

Aris_O

Hi @malimasood,

Welcome to Google Cloud Community.

It looks like that the following line is to look for the error:

doc = documentai.types.Document(docDictionary)

Although it appears that the Document class constructor does not accept a dictionary as a parameter, the docDictionary variable represents the Document object returned by the Document AI API as a dictionary.

Instead, you should generate a new Document object from the dictionary representation using the from_dict class function of the Document class. Here's how to change your code such that it uses from_dict:

doc = documentai.types.Document.from_dict(docDictionary)

After making this modification, the doc variable ought to have a legitimate Document object that you can give to the constructor of the CreateDocumentRequest method.

Here are some documentation that might help you.
https://cloud.google.com/document-ai/docs/reference/rest/v1/Document?_ga=2.116734662.-1392753435.167...
https://cloud.google.com/document-ai/docs/handle-response?_ga=2.116734662.-1392753435.1676655686
https://cloud.google.com/discovery-engine/media/docs/documents?_ga=2.138870387.-1392753435.167665568...

malimasood

Thanks for the prompt response, but I tried your solution it still throws an exception

Traceback (most recent call last):

File "<ipython-input-1-d034c5106990>", line 1, in <module>
runfile('C:/Users/HP/Downloads/test simple.py', wdir='C:/Users/HP/Downloads')

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/HP/Downloads/test simple.py", line 110, in <module>
process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)

File "C:/Users/HP/Downloads/test simple.py", line 88, in process_document_sample
doc=documentai.types.Document.from_dict(docDictionary)

AttributeError: type object 'Document' has no attribute 'from_dict'

Code Changes:
docDictionary = result.document.__dict__
doc=documentai.types.Document.from_dict(docDictionary)

How to insert JSON along with PDF into Document AI Warehouse using API