I am working on setting up a pipeline in Dataform to create date-sharded tables that can be referenced as dependencies in upcoming steps in Dataform. My current approach involves using a WHILE loop to create a table for each month up until the current month. Here is the script I am using:
config { type: "operations",
schema: "datasetID",
tags: ["sales"],
hasOutput:true }
DECLARE active_month DATE DEFAULT DATE("2024-01-01");
DECLARE formatted_shard_date STRING;
DECLARE sql STRING; -- Get the fully qualified table name as a string
DECLARE table_reference STRING; SET table_reference = "${ref("dataset", "table")}";
WHILE active_month < DATE_TRUNC(CURRENT_DATE(), MONTH) DO
BEGIN
SET active_month = active_month;
SET formatted_shard_date = FORMAT_DATE('%Y%m%d', active_month);
SET sql = FORMAT(""" CREATE OR REPLACE TABLE `project_id.dataset.table_%s` PARTITION BY RANGE_BUCKET(state_id, GENERATE_ARRAY(0, 3999,1)) CLUSTER BY district_state AS (......))""", formatted_shard_date)
EXECUTE IMMEDIATE sql;
SET active_month = DATE_ADD(active_month, INTERVAL 1 MONTH); END; END WHILE;
In this case i do not know how to refer to the created tables in dataform and use this as a dependency.
What I need help with are the following:
Conditional Execution: I want the script to execute fully on a full refresh (creating tables for each month), but only create the table for the latest month if not run with a full refresh. How can I modify the script to accommodate this condition?
Dependency Management: How can I reference the tables created in this script as dependencies in subsequent .sqlx files in Dataform? This is especially so as I'm referring to all date-sharded tables created by using WHILE loop
Incremental Updates: I am unsure how to define this script as incremental or not. What are the best practices for managing incremental table updates in this scenario?
Any advice on how to structure this script better for my requirements would be greatly appreciated!
Thank you!
Hello,
Thank you for contacting Google Cloud Community!
I would suggest you the following:
Incremental Updates
To handle incremental updates efficiently, consider these approaches:
Regards,
Jai Ade
but how am i able to create date sharded tables with incremental updates? While I understand that we are able to use partition, I'm looking to create date sharded table that is partitioned on non-date column. In this way, I'm able to further segment the table created and optimize on read.