Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

AI to calculate content value score, 0 for low value test content and 1 for high value content

We have a large number of documents, by large, I mean document count is 10s of millions. Majority of these documents are ASPX pages. However, there are other formats too such as Microsoft Office docx, xlsx etc.  We need to delete documents that are really old and has low value. Given the volume of data, I was wondering if AI could help here. E.g. if AI could separate junk docs from actual real data. A junk doc is something that has 'testing testing' or 'Lorem ipsum'

May be an AI service to place a score on each document, 0 for very low value (has Lorem ipsum) and 1 for high value (has certain keywords such as customer names  etc). Could anyone please give any pointers about such a service?

1 2 162
2 REPLIES 2

Certainly, you can leverage natural language processing (NLP) techniques and machine learning models to help with the task of identifying valuable documents from your large dataset.

  • Utilize an NLP model to analyze the text content of your documents. Train a model to identify patterns associated with junk documents, such as the presence of phrases like 'testing testing' or 'Lorem ipsum'.
  • Train another model to identify keywords or entities that indicate high-value content. This could include customer names, specific terms related to your business, or any other relevant keywords.
  • Consider using machine learning frameworks like TensorFlow, PyTorch, or scikit-learn to build and train your models.

Thank you, appreciate your response. Would you be able to share some reading or video resources about NLP and model training. I could google but if you know of some good ones that would be helpful.