Discrepancies in Confusion Matrix Item Counts and ... - Page 2

suzuki0430 · 07-17-2024 11:12 PM

**Question 1:**
Why does the total number of items in the confusion matrix increase to approximately 200,000 when training an XGBoost model with about 100,000 records using BigQuery ML, whereas it remains around 100,000 with Vertex's AutoML?

**Question 2:**
Why are the recall and precision values lower when training XGBoost on Vertex's notebook compared to BigQuery ML, despite using roughly the same hyperparameters?

Notebook metrics: {'precision': 0.8919952583956949, 'recall': 0.753797304483842, 'f1': 0.8078981867164722, 'loss': 0.014006406471894417}
BigQuery ML metrics: {'precision': 0.6134, 'recall': 0.3719, 'f1': 0.4630, 'loss': 0.0446}

Below is the query for BigQuery ML:
```
CREATE OR REPLACE MODEL `***.model_xg_boost.model_best_params`
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
BOOSTER_TYPE = 'GBTREE',
LEARN_RATE = 0.01,
MAX_ITERATIONS = 300,
MAX_TREE_DEPTH = 5,
SUBSAMPLE = 0.9,
EARLY_STOP = FALSE,
L2_REG = 0.1,
DATA_SPLIT_METHOD = 'RANDOM',
DATA_SPLIT_EVAL_FRACTION = 0.2,
INPUT_LABEL_COLS = ['reaction']
) AS
SELECT
reaction,
year,
month,
day,
hour,
...
FROM
`***.***.***`;
```

Discrepancies in Confusion Matrix Item Counts and Metrics Between BigQuery ML and Vertex's XGBoost