Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform Incremental table - Am I thinking right?

Hi,

1) I have a table that is being formed through Dataform every hour (I have an Airflow DAG Calling Dataform)
2) The Dataform file is named: Table_A.sqlx and following is the query I type in the Table_A.sqlx to get the table Table_A

config {
    type: "table",
    description: "Example table called Table_A",
    
}

SELECT 
    a,
    b

FROM 
    (
    SELECT * FROM ${ref("XXX_YYY_ZZZ")} AS somename1
    UNION ALL
    SELECT * FROM ${ref("AAA_BBB_CCC")} AS somename2
    
    ) AS combined_tables

I want to do the following

3) I want to have an incremental table called: Table_B

4) So I will file called Table_B.sqlx

5) Table_A has a column called timestamp

6) From Table_A I want to write all the rows into Table_B. The Table_B should be appended with all rows of Table_A

7) Hence I say that Table_B is incremental table

😎 The condition for merging (or, putting) records of Table_B into Table_A is that the timestamp record of Table_B must be greater than the MAX(timestamp) record of Table A

What should be the content of file Table_B.sqlx?

IS this correct?

The contents of Table_B.sqlx will be as follows:

config {
    type: "incremental",
    description: "Example table called Table_B",
    
}

SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})`) }

IS this right?

@ms4446 

Solved Solved
0 4 555
1 ACCEPTED SOLUTION

The COALESCE function is designed to handle situations where you might have NULL values (which represent missing or unknown data). It takes a list of expressions and returns the first one that isn't NULL.

Below uses COALESCE cleverly to make sure all the data from Table_A gets copied into Table_B the first time you run the process:

 
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') FROM ${self()})
  1. MAX(timestamp): This part tries to find the most recent timestamp in Table_B. Since Table_B might be empty initially, this could return NULL.

  2. COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01'): COALESCE comes to the rescue here:

    • Scenario 1: Table_B has data: If MAX(timestamp) returns an actual timestamp, COALESCE simply returns that timestamp. This ensures that only new data (with more recent timestamps) gets pulled from Table_A.
    • Scenario 2: Table_B is empty: If MAX(timestamp) returns NULL, COALESCE provides the fallback timestamp '1970-01-01'. This makes sure all the records from Table_A are included in the initial load, as their timestamps will naturally be more recent than the fallback.

The beauty of COALESCE is its conciseness. It handles the "empty table" scenario elegantly in a single line of code.

View solution in original post

4 REPLIES 4