Extract data from previous page

Scribble932 · 08-02-2023 08:02 AM

Hello community!

I'm trying to create a custom processor for some particular bill of lading we receive from a supplier.

Each document (pdf) contains several pages which are just the scans of the original bill of lading.

Each bill of lading contains the following data:

Package number: there can be several packages in each shipping. Each of them can contain one or more products
Product Code
Quantity
Product Description

which, visually, are structured as follows:

Package number: 123456
- Prod cod 1 | Qty | Description
- Prod cod 2 | Qty | Description
- Prod cod 3 | Qty | Description
- Prod cod 4 | Qty | Description

etc.

The rows span across several page. Here's an example:

--- PAGE 1 ---

Package number: 123456
- Prod cod 1 | Qty | Description
- Prod cod 2 | Qty | Description

--- END OF PAGE 1 ---

---- PAGE 2 ---

- Prod cod 3 | Qty | Description
- Prod cod 4 | Qty | Description

Package number: 78910
- Prod cod 5 | Qty | Description

--- END OF PAGE 2 ---

As you can see in the example, the Package number for "Prod. cod 3" an "Prod. cod 4" (which is "123456") is not present on page 2.

So my question is: how can I tell the processor something like "if package number is not present above the product code, then take the latest Package Number from the previous row OR, in case the Package Number is not present in the previous row, check the last package number of the previous page"?

Hope the question is clear enough, unfortunately I can't provide the original document.

Thank you!

kvandres

Good day @Scribble932,

Welcome to Google Cloud Community!

You may be able to achieve this by applying logic when you are handling the processing response of the custom extractor. You can extract the data by creating a for loop on each page and check if the product code is present in that page, and if there is a product code, it must store it in a variable, since the first page as you've mentioned always contains the product code, the variable will always have a value, so if the processor was not able to extract a field value for the product code, it can get the value from that variable but if the variable contains product code of the document and there is another document that is currently being processed the variable will be updated by the new value. You can keep track on the page number using the pages[].pageNumber. You can check this link for more information: https://cloud.google.com/document-ai/docs/handle-response#documentai_process_ocr_document-python

Hope this helps!

Scribble932

Hello kwandres,
thank you for the answer, sounds like a solution for me.

One question: is there a way to create variables from the "Train" section of the Google Cloud Platform Console?

I mean, from here:

Thank you