Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Reference to Date Sharded Tables Created Using While Loop in dataform

xj
Bronze 1
Bronze 1

I am working on setting up a pipeline in Dataform to create date-sharded tables that can be referenced as dependencies in upcoming steps in Dataform. My current approach involves using a WHILE loop to create a table for each month up until the current month. Here is the script I am using:

 

config { type: "operations", 
schema: "datasetID",
 tags: ["sales"], 
hasOutput:true } 

DECLARE active_month DATE DEFAULT DATE("2024-01-01"); 
DECLARE formatted_shard_date STRING; 
DECLARE sql STRING; -- Get the fully qualified table name as a string 
DECLARE table_reference STRING; SET table_reference = "${ref("dataset", "table")}"; 

WHILE active_month < DATE_TRUNC(CURRENT_DATE(), MONTH) DO 
BEGIN 
SET active_month = active_month; 
SET formatted_shard_date = FORMAT_DATE('%Y%m%d', active_month); 
SET sql = FORMAT(""" CREATE OR REPLACE TABLE `project_id.dataset.table_%s` PARTITION BY RANGE_BUCKET(state_id, GENERATE_ARRAY(0, 3999,1)) CLUSTER BY district_state AS (......))""", formatted_shard_date) 

EXECUTE IMMEDIATE sql; 

SET active_month = DATE_ADD(active_month, INTERVAL 1 MONTH); END; END WHILE;

 

In this case i do not know how to refer to the created tables in dataform and use this as a dependency. 

What I need help with are the following:

  • Conditional Execution: I want the script to execute fully on a full refresh (creating tables for each month), but only create the table for the latest month if not run with a full refresh. How can I modify the script to accommodate this condition?

  • Dependency Management: How can I reference the tables created in this script as dependencies in subsequent .sqlx files in Dataform? This is especially so as I'm referring to all date-sharded tables created by using WHILE loop

  • Incremental Updates: I am unsure how to define this script as incremental or not. What are the best practices for managing incremental table updates in this scenario?

Any advice on how to structure this script better for my requirements would be greatly appreciated!

 

Thank you!

0 2 296
2 REPLIES 2

Hello,

Thank you for contacting Google Cloud Community!

I would suggest you the following:

Incremental Updates
To handle incremental updates efficiently, consider these approaches:

  • Design your tables as append-only, adding new data to existing partitions. This avoids costly table rewrites and improves performance.
  • Use Dataform's partition management capabilities to automatically create new partitions as needed. This can help streamline the process of managing date-sharded tables.
  • If your source system supports CDC, leverage it to identify changes and apply them to your target tables incrementally.

Regards,
Jai Ade

xj
Bronze 1
Bronze 1

but how am i able to create date sharded tables with incremental updates? While I understand that we are able to use partition, I'm looking to create date sharded table that is partitioned on non-date column. In this way, I'm able to further segment the table created and optimize on read.