Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

I was wondering if there is a way to run a file incrementally in Dataform and use UUID to generate p

I was wondering if there is a way to run a file incrementally in Dataform and use UUID to generate primary key to not update it every time it is run?

Every time I run it the "question_text_id_sk" is updated when it should be kept with the same field

 

 

 

 

config {
    schema: "survey_warehouse",
    type: "incremental",
    uniqueKey: ["question_text"],
    tags: ["dimension", "daily"],
    columns: {
        question_text_id_sk: "Primary Key", // Column descriptions are pushed to BigQuery.
    }
}

SELECT
  GENERATE_UUID() AS question_text_id_sk,
  IFNULL(msfp_questiontext, "") AS question_text,
  CURRENT_DATETIME() AS ingestion_time
FROM (
  SELECT
    DISTINCT msfp_questiontext,
    CURRENT_DATETIME() AS ingestion_time
  FROM
    `msd_landing_bq.vw_msfp_questions` )
${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }

 

 

 

 

 

 

 

 

Solved Solved
0 1 137
1 ACCEPTED SOLUTION

Your current configuration generates a new UUID for question_text_id_sk each time the incremental table runs. This behavior occurs because GENERATE_UUID() is executed within the SELECT statement, leading to a new UUID on every invocation.

To preserve the question_text_id_sk for existing rows, we need to generate UUIDs only for new rows. This can be achieved by using a LEFT JOIN to check if a question_text already exists in the target table. If it does, we retain the existing question_text_id_sk; if it doesn’t, we generate a new UUID.

 

config {
  schema: "survey_warehouse",
  type: "incremental",
  uniqueKey: ["question_text"],
  tags: ["dimension", "daily"],
  columns: {
    question_text_id_sk: "Primary Key",
  }
}

SELECT
  COALESCE(existing_data.question_text_id_sk, GENERATE_UUID()) AS question_text_id_sk, 
  IFNULL(base_data.msfp_questiontext, "") AS question_text,
  base_data.ingestion_time
FROM (
  SELECT
    DISTINCT msfp_questiontext,
    CURRENT_DATETIME() AS ingestion_time
  FROM
    `msd_landing_bq.vw_msfp_questions`
  ${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }
) AS base_data
LEFT JOIN 
  ${self()} AS existing_data 
ON 
  base_data.msfp_questiontext = existing_data.question_text;

 

  • LEFT JOIN: The LEFT JOIN connects the new data (base_data) with the existing data in the target table (existing_data).
  • COALESCE: COALESCE intelligently selects the question_text_id_sk from the existing_data if it exists. For new rows (where the join doesn’t find a match), it generates a new UUID.

 

View solution in original post

1 REPLY 1

Your current configuration generates a new UUID for question_text_id_sk each time the incremental table runs. This behavior occurs because GENERATE_UUID() is executed within the SELECT statement, leading to a new UUID on every invocation.

To preserve the question_text_id_sk for existing rows, we need to generate UUIDs only for new rows. This can be achieved by using a LEFT JOIN to check if a question_text already exists in the target table. If it does, we retain the existing question_text_id_sk; if it doesn’t, we generate a new UUID.

 

config {
  schema: "survey_warehouse",
  type: "incremental",
  uniqueKey: ["question_text"],
  tags: ["dimension", "daily"],
  columns: {
    question_text_id_sk: "Primary Key",
  }
}

SELECT
  COALESCE(existing_data.question_text_id_sk, GENERATE_UUID()) AS question_text_id_sk, 
  IFNULL(base_data.msfp_questiontext, "") AS question_text,
  base_data.ingestion_time
FROM (
  SELECT
    DISTINCT msfp_questiontext,
    CURRENT_DATETIME() AS ingestion_time
  FROM
    `msd_landing_bq.vw_msfp_questions`
  ${when(incremental(), `WHERE ingestion_time > (SELECT MAX(ingestion_time) FROM ${self()})`) }
) AS base_data
LEFT JOIN 
  ${self()} AS existing_data 
ON 
  base_data.msfp_questiontext = existing_data.question_text;

 

  • LEFT JOIN: The LEFT JOIN connects the new data (base_data) with the existing data in the target table (existing_data).
  • COALESCE: COALESCE intelligently selects the question_text_id_sk from the existing_data if it exists. For new rows (where the join doesn’t find a match), it generates a new UUID.