Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to insert JSON along with PDF into Document AI Warehouse using API

Hi, 

Our usecase is to process the PDF documents from Document AI process and pass the JSON file along with the PDF to the document warehouse. I am using contentwarehouse.CreateDocumentRequest function, the function works well if I only supply the PDF document, but if I process the file from the document and push the JSON along with the PDF the it gives an error saying the following:


File "<ipython-input-63-d034c5106990>", line 1, in <module>
runfile('C:/Users/HP/Downloads/test simple.py', wdir='C:/Users/HP/Downloads')

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/HP/Downloads/test simple.py", line 110, in <module>
process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)

File "C:/Users/HP/Downloads/test simple.py", line 88, in process_document_sample
doc=documentai.types.Document(docDictionary)

File "D:\Anaconda3\lib\site-packages\proto\message.py", line 566, in __init__
"Unknown field for {}: {}".format(self.__class__.__name__, key)

ValueError: Unknown field for Document: _pb

 


Following is the code Snippet:

def process_document_sample(
project_id: str,
location: str,
processor_id: str,
file_path: str,
mime_type: str,
field_mask: str = None,
😞
# You must set the api_endpoint if you use a location other than 'us'.
opts = storage.Client(project_id)

client = documentai.DocumentProcessorServiceClient()

# The full resource name of the processor, e.g.:
# projects/{project_id}/locations/{location}/processors/{processor_id}
name = client.processor_path(project_id, location, processor_id)

# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()

# Load Binary Data into Document AI RawDocument Object
raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

# Configure the process request
request = documentai.ProcessRequest(
name=name, raw_document=raw_document, field_mask=field_mask
)

result = client.process_document(request=request)

# return result
# TODO(developer): Uncomment these variables before running the sample.
# project_number = 'YOUR_PROJECT_NUMBER'
# location = 'YOUR_PROJECT_LOCATION' # Format is 'us' or 'eu'
#print(result.document.entities)

# Create a Schema Service client
import json
with open('test.json','w') as f:
json.dump(documentai.Document.to_dict(result.document),f)
# documentai.Document.to_dict(result)
document_schema_client = contentwarehouse.DocumentSchemaServiceClient()

# The full resource name of the location, e.g.:
# projects/{project_number}/locations/{location}
parent = document_schema_client.common_location_path(
project=config['project_number'], location=config['location']
)

 

# Create a Document Service client
document_client = contentwarehouse.DocumentServiceClient()

# The full resource name of the location, e.g.:
# projects/{project_number}/locations/{location}
parent = document_client.common_location_path(
project=config['project_number'], location=config['location']
)
#print(result.document._pb)
docDictionary = result.document.__dict__
doc=documentai.types.Document(docDictionary)
# Define Document
document = contentwarehouse.Document(
# raw_document_file_type=1,
display_name="60.pdf",
document_schema_name=schema_URI,
inline_raw_document=open('60.pdf','rb').read(),
#plain_text=str(result.document)
cloud_ai_document=doc
)

# Define Request
create_document_request = contentwarehouse.CreateDocumentRequest(
parent=parent, document=document
)

# Create a Document for the given schema
response = document_client.create_document(request=create_document_request)
# print(response)


process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)


I have read all the documentation, but couldn't find why the dictionary is not being picked by the object. 


0 2 1,654
2 REPLIES 2

Hi @malimasood,

Welcome to Google Cloud Community.

It looks like that the following line is to look for the error:

 

doc = documentai.types.Document(docDictionary)

 

Although it appears that the Document class constructor does not accept a dictionary as a parameter, the docDictionary variable represents the Document object returned by the Document AI API as a dictionary.

Instead, you should generate a new Document object from the dictionary representation using the from_dict class function of the Document class. Here's how to change your code such that it uses from_dict:

 

doc = documentai.types.Document.from_dict(docDictionary)

 

After making this modification, the doc variable ought to have a legitimate Document object that you can give to the constructor of the CreateDocumentRequest method.

Here are some documentation that might help you.
https://cloud.google.com/document-ai/docs/reference/rest/v1/Document?_ga=2.116734662.-1392753435.167...
https://cloud.google.com/document-ai/docs/handle-response?_ga=2.116734662.-1392753435.1676655686
https://cloud.google.com/discovery-engine/media/docs/documents?_ga=2.138870387.-1392753435.167665568...


Thanks for the prompt response, but I tried your solution it still throws an exception



Traceback (most recent call last):

File "<ipython-input-1-d034c5106990>", line 1, in <module>
runfile('C:/Users/HP/Downloads/test simple.py', wdir='C:/Users/HP/Downloads')

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/HP/Downloads/test simple.py", line 110, in <module>
process_document_sample(config['project_id'],config['location'],config['Custom_processor_id'],file_path,mime_type)

File "C:/Users/HP/Downloads/test simple.py", line 88, in process_document_sample
doc=documentai.types.Document.from_dict(docDictionary)

AttributeError: type object 'Document' has no attribute 'from_dict'


Code Changes:
docDictionary = result.document.__dict__
doc=documentai.types.Document.from_dict(docDictionary)