Hello,
I am trying to run the following requests:
CREATE OR REPLACE MODEL `energy_generation.data_arima_model_ND`
OPTIONS(
MODEL_TYPE="ARIMA_PLUS",
TIME_SERIES_TIMESTAMP_COL="TIMESTAMP",
TIME_SERIES_DATA_COL="ND"
) AS
SELECT TIMESTAMP, ND FROM `energy_generation.historic_demand_2016`;
SELECT * FROM ML.DETECT_ANOMALIES(
MODEL `energy_generation.data_arima_model_ND`,
STRUCT(0.9 AS anomaly_prob_threshold),
(
SELECT * FROM `energy_generation.historic_demand_2016`
)
)
The values I am getting are all 'null', i.e. is_anomaly, lower_bound, upper_bound, anomaly_probability are all null.
For context, I will eventually be changing the table in ML.DETECT_ANOMALIES to a separate table which is why I am testing this while specifying the table rather than just having the model used.
I have tried adding 'TIME_SERIES_ID_COL' but I am getting the same issue.
I am not sure what I am missing and would appreciate any help!
Solved! Go to Solution.
Thank you for your help! Point 5 was the most useful tip, I had to use `ML.FORECAST` rather than `ML.PREDICT` but when I did that I realised that I could only use future data when doing the anomaly detection on a different dataset! The problem with specifying the same dataset as the one the model was trained on is that this isn't 'future data'.
I find this really interesting! When you don't specify a separate dataset, it does the anomaly detection on the same dataset the model was trained on (not future data), but if you indicate a dataset, you have to use future data. Was this a design choice?
Here are potential causes and solutions for the null values issue in ML.DETECT_ANOMALIES:
Data Inconsistency:
ML.DETECT_ANOMALIES
matches the training data in column names, data types, and timestamp formats. This is crucial for model consistency.Missing TIME_SERIES_ID_COL:
TIME_SERIES_ID_COL
, ensure it's correctly specified in ML.DETECT_ANOMALIES
. If your model doesn't use this column, this point may not be relevant.Forecast Horizon Limits:
ML.INFORMATION_SCHEMA.MODELS
to understand your model's forecast horizon. This helps in ensuring that your new data falls within this range.Anomaly Probability Threshold:
anomaly_prob_threshold
is critical in defining what constitutes an anomaly. Adjusting this threshold can help in fine-tuning the sensitivity of anomaly detection.Additional Troubleshooting Steps:
ML.PREDICT
to better understand the model's predictions and behavior with the new data.Thank you for your help! Point 5 was the most useful tip, I had to use `ML.FORECAST` rather than `ML.PREDICT` but when I did that I realised that I could only use future data when doing the anomaly detection on a different dataset! The problem with specifying the same dataset as the one the model was trained on is that this isn't 'future data'.
I find this really interesting! When you don't specify a separate dataset, it does the anomaly detection on the same dataset the model was trained on (not future data), but if you indicate a dataset, you have to use future data. Was this a design choice?
Your observation about the behavior of BigQuery ML's ML.DETECT_ANOMALIES
function is quite insightful. Let's delve into the rationale behind this design:
Anomaly Detection on Training Data: When ML.DETECT_ANOMALIES
is used without specifying a separate dataset, it performs anomaly detection on the training data. This is useful for understanding the model's behavior on known data and identifying any anomalies that occurred in the past. It's a way to validate the model's ability to detect anomalies based on the data it was trained on.
Requirement for Future Data in Separate Dataset: When specifying a separate dataset in ML.DETECT_ANOMALIES
, the expectation is that this dataset represents new, unseen data — essentially, future data relative to the training dataset. This aligns with the typical use case of anomaly detection models, where the goal is to identify anomalies in incoming, real-time, or future data based on patterns learned from historical data.
Design Choice for Practical Application: This design choice reflects the practical application of anomaly detection models. In real-world scenarios, models are often trained on historical data and then applied to new data to detect anomalies. By requiring future data for a separate dataset, BigQuery ML ensures that the model is used in a way that is most relevant and valuable for predictive analytics.
Encouraging Proper Model Evaluation: This approach also encourages users to properly evaluate their models. By analyzing the model's performance on historical data (training dataset) and then applying it to future data, users can better assess the model's predictive power and accuracy.
The behavior of ML.DETECT_ANOMALIES
in BigQuery ML is aligned with the typical workflow of training on historical data and predicting on future data. This design choice ensures that the models are used in a context that maximizes their predictive capabilities and aligns with real-world applications of anomaly detection.