Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex AI AutoML tabular null numeric values

I have a training dataset with some numeric columns that are appropriately nullable. Eg think of a propensity model for employee churn where we have a feature for days_until_next_leave. Not all employees have this feature populated if they don't have leave booked, but many do.

How should I treat this in vertex autoML? It seems null numeric rows are basically excluded? I would have expected "autoML" to automatically handle nulls, but apparently that's not the case.

I can impute it myself, but since Google keeps the autoML algorithm secret I have no idea what an appropriate imputation value would be (since it depends on the algorithm family).

0 2 309
2 REPLIES 2

Hi @Jwaugh,

Welcome to Google Cloud Community!

When using Vertex AI AutoML for your propensity model, it's important to manage missing values properly. Here’s a simpler guide on how to manage null values:

1. Handling Null Values in Vertex AI AutoML:

  • Allow Invalid Values Setting: Vertex AI AutoML can be set to handle missing or invalid values by turning on the "allow invalid values" option. This lets the model work with rows that have null values. You need to enable this setting for each column where you have null values. You can find this option in the dataset settings under model training.
  • Imputation: You can also handle null values yourself by filling them in before you upload your data. Common ways to fill in missing values include:
    • Mean Imputation: Replace null values with the average of the existing values in that column.
    • Median Imputation: Replace null values with the middle value, which is less affected by outliers.
    • Mode Imputation: For categorical data, you can use the most frequent value.

2. Preparing Your Data:

  • Check Data Types and Formats: Make sure all numbers are formatted correctly. For dates, use the format yyyy-mm-dd. For numeric columns, ensure all values are in decimal format (e.g., 0.0 instead of 0).
  • Standardize Numeric Columns: Ensure that all numeric columns have the same format. For example, converting all integers to floats can help prevent issues during training.

3. Training Your Model:

  • Evaluate Performance: After training, check how well your model performs. If you have a separate validation set, test how the model does with and without imputed values to see if your imputation method is effective.
  • Cross-Validation: Use cross-validation to make sure that your imputation method does not unfairly affect the model’s performance.

For better AutoML models, handle null values and prepare data thoroughly. Experiment with imputation and data prep techniques to optimize results.

I hope the above information is helpful.

Honestly it is not very helpful to say "experiment for the best results" in an "autoML" product which is supposed to be plug and play.