I was wondering if there is a way to run a file incrementally in Dataform and use UUID to generate primary key to not update it every time it is run?
Every time I run it the "question_text_id_sk" is updated when it should be kept with the same field
config {
schema: "survey_warehouse",
type: "incremental",
uniqueKey: ["question_text"],
tags: ["dimension", "daily"],
columns: {
question_text_id_sk: "Primary Key", // Column descriptions are pushed to BigQuery.
}
}
SELECT
GENERATE_UUID() AS question_text_id_sk,
IFNULL(msfp_questiontext, "") AS question_text,
CURRENT_DATETIME() AS ingestion_time
FROM (
SELECT
DISTINCT msfp_questiontext,
CURRENT_DATETIME() AS ingestion_time
FROM
`msd_landing_bq.vw_msfp_questions` )
${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }
Solved! Go to Solution.
Your current configuration generates a new UUID for question_text_id_sk each time the incremental table runs. This behavior occurs because GENERATE_UUID() is executed within the SELECT statement, leading to a new UUID on every invocation.
To preserve the question_text_id_sk for existing rows, we need to generate UUIDs only for new rows. This can be achieved by using a LEFT JOIN to check if a question_text already exists in the target table. If it does, we retain the existing question_text_id_sk; if it doesn’t, we generate a new UUID.
config {
schema: "survey_warehouse",
type: "incremental",
uniqueKey: ["question_text"],
tags: ["dimension", "daily"],
columns: {
question_text_id_sk: "Primary Key",
}
}
SELECT
COALESCE(existing_data.question_text_id_sk, GENERATE_UUID()) AS question_text_id_sk,
IFNULL(base_data.msfp_questiontext, "") AS question_text,
base_data.ingestion_time
FROM (
SELECT
DISTINCT msfp_questiontext,
CURRENT_DATETIME() AS ingestion_time
FROM
`msd_landing_bq.vw_msfp_questions`
${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }
) AS base_data
LEFT JOIN
${self()} AS existing_data
ON
base_data.msfp_questiontext = existing_data.question_text;
Your current configuration generates a new UUID for question_text_id_sk each time the incremental table runs. This behavior occurs because GENERATE_UUID() is executed within the SELECT statement, leading to a new UUID on every invocation.
To preserve the question_text_id_sk for existing rows, we need to generate UUIDs only for new rows. This can be achieved by using a LEFT JOIN to check if a question_text already exists in the target table. If it does, we retain the existing question_text_id_sk; if it doesn’t, we generate a new UUID.
config {
schema: "survey_warehouse",
type: "incremental",
uniqueKey: ["question_text"],
tags: ["dimension", "daily"],
columns: {
question_text_id_sk: "Primary Key",
}
}
SELECT
COALESCE(existing_data.question_text_id_sk, GENERATE_UUID()) AS question_text_id_sk,
IFNULL(base_data.msfp_questiontext, "") AS question_text,
base_data.ingestion_time
FROM (
SELECT
DISTINCT msfp_questiontext,
CURRENT_DATETIME() AS ingestion_time
FROM
`msd_landing_bq.vw_msfp_questions`
${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }
) AS base_data
LEFT JOIN
${self()} AS existing_data
ON
base_data.msfp_questiontext = existing_data.question_text;