Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Discrepancies in Confusion Matrix Item Counts and Metrics Between BigQuery ML and Vertex's XGBoost

**Question 1:**
Why does the total number of items in the confusion matrix increase to approximately 200,000 when training an XGBoost model with about 100,000 records using BigQuery ML, whereas it remains around 100,000 with Vertex's AutoML?

**Question 2:**
Why are the recall and precision values lower when training XGBoost on Vertex's notebook compared to BigQuery ML, despite using roughly the same hyperparameters?

Notebook metrics: {'precision': 0.8919952583956949, 'recall': 0.753797304483842, 'f1': 0.8078981867164722, 'loss': 0.014006406471894417}
BigQuery ML metrics: {'precision': 0.6134, 'recall': 0.3719, 'f1': 0.4630, 'loss': 0.0446}

Below is the query for BigQuery ML:
```
CREATE OR REPLACE MODEL `***.model_xg_boost.model_best_params`
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
BOOSTER_TYPE = 'GBTREE',
LEARN_RATE = 0.01,
MAX_ITERATIONS = 300,
MAX_TREE_DEPTH = 5,
SUBSAMPLE = 0.9,
EARLY_STOP = FALSE,
L2_REG = 0.1,
DATA_SPLIT_METHOD = 'RANDOM',
DATA_SPLIT_EVAL_FRACTION = 0.2,
INPUT_LABEL_COLS = ['reaction']
) AS
SELECT
reaction,
year,
month,
day,
hour,
...
FROM
`***.***.***`;
```

0 1 266
1 REPLY 1

Hi @suzuki0430,

Welcome to Google Cloud Community!

To answer your first question, The total number of items in the confusion matrix is likely increasing when training an XGBoost model with 100,000 records using BigQuery ML, because BigQuery ML is using a different sampling strategy than Vertex’s AutoML. This means that BigQuery ML is selecting larger subsets of data for training, which results in a larger confusion matrix. You can also check the BigQuery ML and Vertex’s AutoML to see how they handle data sampling. 

For the second question, The recall and precision values are lower when training XGBoost on Vertex’s notebook compared to BigQuery ML despite using roughly the same hyperparameters because the data used for training might be different. 

Here’s why this could happen:

  • The data might be pre-processed differently in Vertex’s notebook and BigQuery ML, leading to variations in the data used for training.
  • The data splitting strategy used for training and validation might be different, leading to variations in the data used for model evaluation.
  • The data used in Vertex’s notebook might have lower quality than the data used in BigQuery ML, leading to lower recall and precision values.

I hope the above information is helpful.