AutoML engine corruption

rubric-engineer · 01-07-2025 06:09 AM

We are training AutoML models for 10 languages, using a consistent dataset of files. However, we’ve observed that some of our test files experience corruption when processed by the trained models for certain languages. This issue does not appear in other languages.

Despite the models successfully completing training and becoming available for use, the following error message is displayed on some models when I click into them on the console:

Unfortunately, the error message is not very descriptive, which makes troubleshooting difficult. However, I have noticed a possible correlation between this error and the corrupt outputs generated during testing.

We have integrated our trained models with SDL WorldServer for testing using the API. The issue is reproducible across multiple languages and affects both .docx and .xml file formats.

Examples of input source and output issues:

Input: {123} -> Output: ID 123
Input: {1}{2}{3} -> Output: {1}{2}{2}{3}
Input: https://website.com -> Output: random text or "link"
Input: segment text -> Output: +
(Note: {n} represents placeholder text.)

Has anyone else encountered similar issues or knows how to resolve this? Any insights would be greatly appreciated.

Thanks!

ruthseki

Hi @rubric-engineer,

Welcome to Google Cloud Community!

The "Task missed its deadline" error in Google Cloud AutoML, coupled with the inconsistent and seemingly corrupted outputs, points towards a problem with either the model's performance or the API interaction, rather than inherent corruption within AutoML itself. The fact that it's language-specific strongly suggests a data or model training issue.

The likely cause of the AutoML model failures is data-related such as insufficient training data for some languages, poor data quality (inconsistencies, errors, imbalances), and potential data leakage. A thorough review of the training data volume, consistency, and cleanliness for each language is crucial.

In addition, it could be due to model performance issues which is due to insufficient training time/resources, suboptimal model architecture/hyperparameters, or inappropriate evaluation metrics. Experiment with longer training times, different model configurations, and a wider range of evaluation metrics to improve model performance.

The error might also stem from API integration problems. Check for API rate limits, verify the correctness of API requests and timeout settings, and ensure proper configuration and error handling within the SDL WorldServer integration.

Here are some workarounds that you may try:

Analyze Data: Start by thoroughly examining the training data for each language, paying close attention to data volume, quality, and consistency.
Reduce the Problem: Try training a smaller model with a subset of your data (for a single problematic language) to see if you can pinpoint the issue.
Monitor API Calls: Log all API requests and responses to identify potential errors or delays.
Experiment with Model Parameters: Try adjusting the training parameters (e.g., increasing training time, changing the model architecture) for the problematic languages.
Check for Errors in SDL WorldServer Logs: Look for any error messages or warnings in the logs of your SDL WorldServer installation.
Contact Google Cloud Support: If you've exhausted all other options, Google Cloud support might be able to provide more insights into the "Task missed its deadline" error. They might have access to logs and metrics not available to you. Also, I suggest filing a defect report. This way you could have visibility on the progress of your request as it is publicly available. Please note that I can't provide any details or timelines at this moment. For future updates, I suggest keeping an eye out on the issue tracker.

The inconsistent outputs (e.g., {1}{2}{3} becoming {1}{2}{2}{3}) suggest a problem with how the model is handling sequences or patterns in the input text, reinforcing the need for a closer look at the training data and model architecture. The "random text" output for URLs implies the model is not correctly recognizing or classifying this type of input. The key to resolving this is likely a detailed examination of your data.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.