Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

ML Detect Anomalies returns null values when using data from a separate table than the model

Hello, 

I am trying to run the following requests:

 

 

 

CREATE OR REPLACE MODEL `energy_generation.data_arima_model_ND` 
OPTIONS(
  MODEL_TYPE="ARIMA_PLUS", 
  TIME_SERIES_TIMESTAMP_COL="TIMESTAMP",
  TIME_SERIES_DATA_COL="ND"
) AS 
SELECT TIMESTAMP, ND FROM `energy_generation.historic_demand_2016`;

SELECT *  FROM ML.DETECT_ANOMALIES(
  MODEL `energy_generation.data_arima_model_ND`,
  STRUCT(0.9 AS anomaly_prob_threshold),
  (
    SELECT * FROM `energy_generation.historic_demand_2016`
  )
) 

 

 

 

The values I am getting are all 'null', i.e. is_anomaly, lower_bound, upper_bound, anomaly_probability are all null.

For context, I will eventually be changing the table in ML.DETECT_ANOMALIES to a separate table which is why I am testing this while specifying the table rather than just having the model used.

I have tried adding 'TIME_SERIES_ID_COL' but I am getting the same issue. 

I am not sure what I am missing and would appreciate any help!

Solved Solved
1 3 1,691
1 ACCEPTED SOLUTION

Thank you for your help! Point 5 was the most useful tip, I had to use `ML.FORECAST` rather than `ML.PREDICT` but when I did that I realised that I could only use future data when doing the anomaly detection on a different dataset! The problem with specifying the same dataset as the one the model was trained on is that this isn't 'future data'.

I find this really interesting! When you don't specify a separate dataset, it does the anomaly detection on the same dataset the model was trained on (not future data), but if you indicate a dataset, you have to use future data. Was this a design choice? 

View solution in original post

3 REPLIES 3

Here are potential causes and solutions for the null values issue in ML.DETECT_ANOMALIES:

  1. Data Inconsistency:

    • Verify data match: Ensure the data used in ML.DETECT_ANOMALIES matches the training data in column names, data types, and timestamp formats. This is crucial for model consistency.
    • Check timestamp frequency: Confirm that the timestamps in your new data have the same frequency as in the training data. Inconsistencies here can lead to inaccurate anomaly detection.
  2. Missing TIME_SERIES_ID_COL:

    • Review model: If your model was trained with a TIME_SERIES_ID_COL, ensure it's correctly specified in ML.DETECT_ANOMALIES. If your model doesn't use this column, this point may not be relevant.
  3. Forecast Horizon Limits:

    • Check model information: Use ML.INFORMATION_SCHEMA.MODELS to understand your model's forecast horizon. This helps in ensuring that your new data falls within this range.
    • Ensure timestamps within range: The timestamps in your new data should be within the model's forecast horizon for accurate anomaly detection.
  4. Anomaly Probability Threshold:

    • Adjust threshold: The anomaly_prob_threshold is critical in defining what constitutes an anomaly. Adjusting this threshold can help in fine-tuning the sensitivity of anomaly detection.
  5. Additional Troubleshooting Steps:

    • Examine raw model output: Inspect the raw output from ML.PREDICT to better understand the model's predictions and behavior with the new data.
    • Review model creation: Re-examine the model creation query for any potential errors or misconfigurations.

Thank you for your help! Point 5 was the most useful tip, I had to use `ML.FORECAST` rather than `ML.PREDICT` but when I did that I realised that I could only use future data when doing the anomaly detection on a different dataset! The problem with specifying the same dataset as the one the model was trained on is that this isn't 'future data'.

I find this really interesting! When you don't specify a separate dataset, it does the anomaly detection on the same dataset the model was trained on (not future data), but if you indicate a dataset, you have to use future data. Was this a design choice? 

Your observation about the behavior of BigQuery ML's ML.DETECT_ANOMALIES function is quite insightful. Let's delve into the rationale behind this design:

  1. Anomaly Detection on Training Data: When ML.DETECT_ANOMALIES is used without specifying a separate dataset, it performs anomaly detection on the training data. This is useful for understanding the model's behavior on known data and identifying any anomalies that occurred in the past. It's a way to validate the model's ability to detect anomalies based on the data it was trained on.

  2. Requirement for Future Data in Separate Dataset: When specifying a separate dataset in ML.DETECT_ANOMALIES, the expectation is that this dataset represents new, unseen data — essentially, future data relative to the training dataset. This aligns with the typical use case of anomaly detection models, where the goal is to identify anomalies in incoming, real-time, or future data based on patterns learned from historical data.

  3. Design Choice for Practical Application: This design choice reflects the practical application of anomaly detection models. In real-world scenarios, models are often trained on historical data and then applied to new data to detect anomalies. By requiring future data for a separate dataset, BigQuery ML ensures that the model is used in a way that is most relevant and valuable for predictive analytics.

  4. Encouraging Proper Model Evaluation: This approach also encourages users to properly evaluate their models. By analyzing the model's performance on historical data (training dataset) and then applying it to future data, users can better assess the model's predictive power and accuracy.

The behavior of ML.DETECT_ANOMALIES in BigQuery ML is aligned with the typical workflow of training on historical data and predicting on future data. This design choice ensures that the models are used in a context that maximizes their predictive capabilities and aligns with real-world applications of anomaly detection.