Solved: Empty pages array in Google Document AI API OCR re...

rajarehan · 04-28-2023 05:49 AM

I'm currently using the Google Document AI API to extract text from PDFs using OCR. However, I've noticed that the pages array in the OCR response is always empty, even though the OCR operation completes successfully and I'm able to retrieve text from the document.

Here's a simplified version of the code I'm using:

from google.cloud import documentai_v1beta3 as documentai

@classmethod
def extract_text(cls, book_link: str):
    """Extract text from book using OCR"""

    # Upload the book to GCS
    filename = cls._upload_file_to_gcs(book_link=book_link)

    # Create the Batch Process Request
    gcs_input_uri = f"gs://{BUCKET}/input/{filename}"
    operation = cls._create_batch_process_request(gcs_input_uri=gcs_input_uri)

    # Wait for the operation to finish
    try:
        operation.result(timeout=300)
    # Catch exception when operation doesn't finish before timeout
    except (RetryError, InternalServerError) as e:
        raise exceptions.APIException(
            detail={e.message}
        )

    metadata = documentai.BatchProcessMetadata(operation.metadata)

    if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
        raise exceptions.APIException(
            detail={metadata.state_message}
        )

    output_documents = cls._get_output_documents(metadata=metadata)

    # Delete the input file from GCS
    cls.gcs_bookmapping_bucket.delete_blob(blob_name=f"input/{filename}")

    # Extract text from the output documents
    book_text = []
    for document in output_documents:
        for page in document.pages: # **here document.pages is always empty**
            book_text.append(
                cls._layout_to_text(layout=page.layout, text=document.text)
            )


    return book_text

The document.text attribute contains the text of the entire document, but the pages array is always empty. This is preventing me from extracting text on a per-page basis, which is something I need for my application.

I've double-checked the input PDF files to ensure that they have multiple pages, so I'm confident that the issue is not with the input data.

I'm using documentai_v1beta3, I've also tried documentai_v1 but still it didn't work.

Has anyone else experienced this issue with the Google Document AI API? Any suggestions for how I can retrieve text on a per-page basis?

Thanks in advance for your help.

rajarehan

Thanks. I noticed that my field mask was "text". I changed it to "text,pages.layout" and it worked.

View solution in original post

kvandres

Good day @rajarehan ,

Welcome to Google Cloud Community!

One of the possible reasons for your encountered issue is because the Document AI API was unable to identify the page boundaries in your PDF files, so it's possible that the pages array is empty. This may occur if the page borders in the PDF files are not clearly defined or if the OCR engine fails to recognize the page limits.

Utilizing the layout data provided in the OCR response is one method for extracting text on a per-page basis. Each identified text block's bounding boxes and corresponding page number are included in the layout information.

Here is a possible code that may extract text per page based on the layout details:


    # Extract text from the output documents
    book_text = []
    for document in output_documents:
        for page in document.pages:
            page_text = ""
            for block in page.blocks:
                # Check if the block is a text block
                if block.block_type == documentai.Block.Type.TEXT:
                    # Check if the block is within the current page
                    if block.layout.page_number == page.page_number:
                        # Extract the text from the block
                        block_text = block.text_anchor.text
                        # Append the text to the page text
                        page_text += block_text
            book_text.append(page_text)

    return book_text

This version of the code loops through each block on each page to determine whether it is a text block and whether the block belongs on the current page. We take the text from the block and add it to the page text if both conditions are satisfied. The `book_text` list is then updated with the page text.

Hope it helps!

rajarehan

Thanks. I noticed that my field mask was "text". I changed it to "text,pages.layout" and it worked.

Empty pages array in Google Document AI API OCR response