We have a large number of documents, by large, I mean document count is 10s of millions. Majority of these documents are ASPX pages. However, there are other formats too such as Microsoft Office docx, xlsx etc. We need to delete documents that are really old and has low value. Given the volume of data, I was wondering if AI could help here. E.g. if AI could separate junk docs from actual real data. A junk doc is something that has 'testing testing' or 'Lorem ipsum'
May be an AI service to place a score on each document, 0 for very low value (has Lorem ipsum) and 1 for high value (has certain keywords such as customer names etc). Could anyone please give any pointers about such a service?
Certainly, you can leverage natural language processing (NLP) techniques and machine learning models to help with the task of identifying valuable documents from your large dataset.
Thank you, appreciate your response. Would you be able to share some reading or video resources about NLP and model training. I could google but if you know of some good ones that would be helpful.