I am new to GCP and machine learning. I would like to use GCP to classify text version of web pages in 2 categories: "good" and "bad". I extracted text by using Readability.js. I am confused how to prepare a dataset. Hopefully somebody from this community help me by answering several questions.
1. Is AutoML the best solution for this classification? Maybe I should choose something else?
2. As I understood minimal JSON format for AutoMl include only label and textContent. But my data has more features like author, domain, url et cetera. Should they be includes to dataset? If yes, how?
3. Would be better to preprocess texts (delete punctuation, lowercase, omit stopwords ....)?
4. Should be labels be word (good and bad) or numbers (1 and 0) would be enough?
Any help would be greatly appreciated!
Solved! Go to Solution.
Hi @alexdruk,
Thank you for joining our community.
That's awesome that you're interested in using GCP, especially AutoML! It's a great tool for getting started with machine learning. I'm happy to share some insights and help you find answers to your questions.
1.AutoML Natural Language and AutoML Vision are no longer available as separate services, but their functionality is now part of Vertex AI. This means you can still build and use similar models within Vertex AI's AutoML tools.
2. The JSON format for AutoML allows including additional features besides text content. These features can potentially improve model performance if they're relevant to your classification task. Adding the author, domain and URL could be helpful features. You can add them as separate key-value pairs in your JSON along with "textContent" and "label." For example:
{
"textContent": "This is a great website with informative content.",
"label": "good",
"author": "some name",
"domain": "sample.com",
"url": "https://www.example.com"
}
See "Prepare text training data for classification" for more information.
3. Text preprocessing is a crucial step for text classification tasks. It helps improve the accuracy and efficiency of your model by making the text data more consistent and easier for the model to understand.
4. Both labels "good" and "bad" (words) and numbers (1 and 0) are acceptable for AutoML. Using words might be slightly more interpretable for humans, but using numbers is generally more efficient for machine learning algorithms.
I hope I was able to provide you with useful insights.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |