Hi,
1) I have a table that is being formed through Dataform every hour (I have an Airflow DAG Calling Dataform)
2) The Dataform file is named: Table_A.sqlx and following is the query I type in the Table_A.sqlx to get the table Table_A
config {
type: "table",
description: "Example table called Table_A",
}
SELECT
a,
b
FROM
(
SELECT * FROM ${ref("XXX_YYY_ZZZ")} AS somename1
UNION ALL
SELECT * FROM ${ref("AAA_BBB_CCC")} AS somename2
) AS combined_tables
I want to do the following
3) I want to have an incremental table called: Table_B
4) So I will file called Table_B.sqlx
5) Table_A has a column called timestamp
6) From Table_A I want to write all the rows into Table_B. The Table_B should be appended with all rows of Table_A
7) Hence I say that Table_B is incremental table
😎 The condition for merging (or, putting) records of Table_B into Table_A is that the timestamp record of Table_B must be greater than the MAX(timestamp) record of Table A
What should be the content of file Table_B.sqlx?
IS this correct?
The contents of Table_B.sqlx will be as follows:
config {
type: "incremental",
description: "Example table called Table_B",
}
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})`) }
IS this right?
Solved! Go to Solution.
The COALESCE function is designed to handle situations where you might have NULL values (which represent missing or unknown data). It takes a list of expressions and returns the first one that isn't NULL.
Below uses COALESCE cleverly to make sure all the data from Table_A
gets copied into Table_B
the first time you run the process:
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') FROM ${self()})
MAX(timestamp)
: This part tries to find the most recent timestamp in Table_B
. Since Table_B
might be empty initially, this could return NULL.
COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01')
: COALESCE comes to the rescue here:
MAX(timestamp)
returns an actual timestamp, COALESCE simply returns that timestamp. This ensures that only new data (with more recent timestamps) gets pulled from Table_A
.MAX(timestamp)
returns NULL, COALESCE provides the fallback timestamp '1970-01-01'. This makes sure all the records from Table_A
are included in the initial load, as their timestamps will naturally be more recent than the fallback.The beauty of COALESCE is its conciseness. It handles the "empty table" scenario elegantly in a single line of code.
Hi @ayushmaheshwari ,
Your approach looks correct and your understanding of incremental tables in Dataform seems spot-on. Here are some additional insights:
Your Table_B.sqlx
file is well-structured and should effectively create an incremental table:
config {
type: "incremental",
description: "Example table called Table_B",
}
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})
Explanation
config { type: "incremental" }
: This is the heart of the incremental table setup. It tells Dataform to only append new rows based on the timestamp condition.
SELECT * FROM ${ref("Table_A")}
: This selects all columns from your source Table_A
.
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})
: This is the crucial filtering logic. It ensures only rows from Table_A
with a timestamp
greater than the maximum timestamp
currently in Table_B
are included. The use of self()
references Table_B
itself.
Additional Considerations
MAX(timestamp)
in Table_B
. Consider replacing the subquery with something like COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01')
to ensure the initial load pulls all data from Table_A
.timestamp
column's data type, use CAST
to explicitly convert it (e.g., CAST(timestamp AS TIMESTAMP)
).Thank you @ms4446 . I do not understand the COALESCE you describe , can you explain if you do not mind? Yes, for the first time it has to simply pull all data from Table A into Table B. How do I tell Dataform that? I do not follow by telling it through COALESCE
The COALESCE function is designed to handle situations where you might have NULL values (which represent missing or unknown data). It takes a list of expressions and returns the first one that isn't NULL.
Below uses COALESCE cleverly to make sure all the data from Table_A
gets copied into Table_B
the first time you run the process:
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') FROM ${self()})
MAX(timestamp)
: This part tries to find the most recent timestamp in Table_B
. Since Table_B
might be empty initially, this could return NULL.
COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01')
: COALESCE comes to the rescue here:
MAX(timestamp)
returns an actual timestamp, COALESCE simply returns that timestamp. This ensures that only new data (with more recent timestamps) gets pulled from Table_A
.MAX(timestamp)
returns NULL, COALESCE provides the fallback timestamp '1970-01-01'. This makes sure all the records from Table_A
are included in the initial load, as their timestamps will naturally be more recent than the fallback.The beauty of COALESCE is its conciseness. It handles the "empty table" scenario elegantly in a single line of code.
One caveat if the table gets partitioned by timestamp, it won't prune as-is.
One should declare a variable with the predicate (in `pre-operations`) and use that var value in the main query.