Solved: AutoML Dataset preparation help

alexdruk · 04-03-2024 02:24 AM

I am new to GCP and machine learning. I would like to use GCP to classify text version of web pages in 2 categories: "good" and "bad". I extracted text by using Readability.js. I am confused how to prepare a dataset. Hopefully somebody from this community help me by answering several questions.

1. Is AutoML the best solution for this classification? Maybe I should choose something else?

2. As I understood minimal JSON format for AutoMl include only label and textContent. But my data has more features like author, domain, url et cetera. Should they be includes to dataset? If yes, how?

3. Would be better to preprocess texts (delete punctuation, lowercase, omit stopwords ....)?

4. Should be labels be word (good and bad) or numbers (1 and 0) would be enough?

Any help would be greatly appreciated!

lsolatorio

Hi @alexdruk,

Thank you for joining our community.

That's awesome that you're interested in using GCP, especially AutoML! It's a great tool for getting started with machine learning. I'm happy to share some insights and help you find answers to your questions.

1.AutoML Natural Language and AutoML Vision are no longer available as separate services, but their functionality is now part of Vertex AI. This means you can still build and use similar models within Vertex AI's AutoML tools.

2. The JSON format for AutoML allows including additional features besides text content. These features can potentially improve model performance if they're relevant to your classification task. Adding the author, domain and URL could be helpful features. You can add them as separate key-value pairs in your JSON along with "textContent" and "label." For example:

{
  "textContent": "This is a great website with informative content.",
  "label": "good",
  "author": "some name",
  "domain": "sample.com",
  "url": "https://www.example.com"
}

See "Prepare text training data for classification" for more information.

3. Text preprocessing is a crucial step for text classification tasks. It helps improve the accuracy and efficiency of your model by making the text data more consistent and easier for the model to understand.

Lowercasing: Converts all text to lowercase letters. This removes case sensitivity, making "good" and "Good" equivalent.
Punctuation removal: Removes punctuation marks like commas, periods, and exclamation points. This can help the model focus on the core meaning of the words.
Stopword removal: Removes common words that don't carry much meaning, like "the," "a," "an," etc. This can be helpful, but be cautious! Sometimes, stopwords can be important depending on the context. For example, "not good" wouldn't be the same without "not."
Tokenization: Splits the text into individual words or meaningful phrases. This allows the model to analyze each word or phrase independently.

4. Both labels "good" and "bad" (words) and numbers (1 and 0) are acceptable for AutoML. Using words might be slightly more interpretable for humans, but using numbers is generally more efficient for machine learning algorithms.

I hope I was able to provide you with useful insights.

View solution in original post

lsolatorio