Re: Extracting data from newspaper articles

tboyd03 · 07-06-2023 12:53 PM

I am working on a historical project and am hoping for some tips. We have about 2500 newspaper articles as PDFs from the last 100 years, but we would like to pull some basic information (article date, meeting speaker and topic, location, etc) for analysis to show trends over the years. I was hoping to be able to at least do a rough pull using ML, but to be honest I am stuck trying to find my starting point. Any tips or guidance would be greatly appreciated! Identifying the useful data from unstructured newspaper articles is something I am hoping to technology can do at this point.

jayita13

@tboyd03 try generative-ai/language/examples/document-qa at main · GoogleCloudPlatform/generative-ai · GitHub

tboyd03

Thanks Jayita13! One question I have is on document size. Most of these articles will be very short and mostly contain the data we need. I tried to review the files you linked but must admit I am rather lost at this stage. For example, some are just meeting announcements like "Charles Smith will speak to the Lions about traffic safety at their meeting to be held at the Sury Hotel." I'd like to be able to pull the speaker (Charles Smith), topic (Traffic Safety), location (Sury Hotel) and date (from the artile PDF header). With close to 3k articles now it would take forever to do this manually, and any automation we could achieve would be a great help. Other articles are longer and require more digging to get the data of the article, but even automating the short ones would be a big leap ahead.

kvandres

Good day @tboyd03,

Welcome to Google Cloud Community!

If you want to extract some texts from the pdf files, you can use Document AI for this, you can try using the Document OCR processor to identify and extract the text including handwritten texts from the document or the pdf file, If you want you can try uploading a sample file using this link: https://cloud.google.com/document-ai/docs/drag-and-drop
You can also build and train your own custom document extractor that will be suited for your PDF files, it will be useful especially if your documents structure and layout are consistent, it will allow you to extract specific sections from your documents (date, speaker, etc.), this will be the better option rather than the OCR based on the information that you have provided since you want to extract some specific sections in your document but please note that this will require training and test data. For more information on building a processor you can check this link:
https://cloud.google.com/document-ai/docs/workbench/build-custom-processor
https://cloud.google.com/document-ai/docs/processors-list#processor_cde

Hope this is useful!

tboyd03

Unfortunately, most of this is unstructured. I could pull the date automatically from the standard header of the articles, but the articles themselves are anything from announcements like I listed above or full page articles about an event.