Multi-page PDF data extraction & model based t...

SnowCub88 · 04-03-2024 08:56 PM

Hi all,

I am trying to develop a data extraction model for multiple page PDF files. These PDF files vary between 5-20 pages and are all scientific papers with various formats. Ideally I would like to create an AI model that I can train to extract the relevant labels/fields across the entire document (i.e. all 15 pages). Whilst training the model it appears that each new page is treated as a "new" document.

Ideally the model will extract the article title once across the PDF and find one DOI and allocate first and final author once instead of trying to locate one on each separate page of the PDF.

I have trained 50+ PDFs and 20+ test document PDFs but I am concerned that the model will not run each separate PDF as one file but instead each page separately if that makes sense?

Apologies if this is a rather simple issue - I am very much new to Google Cloud AI and cannot seem to locate any previous questions with a similar problem.

Many thanks

HenArevalo98

I HAVE TE SAME PROBLEM

Could you resolved the issue ?

Multi-page PDF data extraction &amp; model based training

Multi-page PDF data extraction & model based training