Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Clarification on Dataform Repository Size Limitations and Actions vs. SQL Workflow Nodes

Hello Dataform Community,

I'm currently working with Dataform and trying to better understand some of the specifics around its repository size limitations and how it manages compilation resources. I have a couple of questions and would greatly appreciate any insights or experiences you might share.

Dataform enforces certain usage limits during the compilation of a repository, such as CPU time usage, maximum heap memory, and notably, a maximum number of actions per repository compilation. The limits I'm particularly curious about are:

  1. Maximum Number of Actions Per Repository Compilation: The limit is set at 5000. Here, I understand that actions refer to defined units of work (like SQL queries) within the data pipeline.

  2. Maximum Repository Size Indicated by SQL Workflow Nodes: There's a mention that large repository sizes, specifically those with more than 1000 SQL workflow nodes, might lead to issues.

My confusion arises from trying to understand the distinction between these two limits. Aren't individual workflow nodes equivalent to actions in this context? If so, why is there a discrepancy between the allowed number of actions (5000) and the suggested maximum number of SQL workflow nodes (1000) for a stable repository?

Additionally, if anyone has tips or best practices on managing and optimizing repositories within these limits, or experiences with the compilation resource usage estimation, I would be very interested to hear about them. I saw the best practices doc on splitting up the repositories. However, I want to know if there is anyway I can calculate/estimate the feasibility of having mono-repo for our existing workflows we have with other tool. 

Thank you in advance for your time and assistance!

Solved Solved
0 9 1,973
1 ACCEPTED SOLUTION

Hi @ylee1212 

Your understanding of the distinction between actions and SQL workflow nodes in a Dataform SQLX file is generally accurate. Let me further clarify:

  1. Assertions: Each assertion is indeed an action, serving as a check or validation on the data. However, they are not typically counted as separate SQL workflow nodes. Rather, they are considered part of the data quality assessment process. Therefore, your SQLX file would have 2 actions but 0 SQL workflow nodes.

  2. Pre/Post Operations: Similar to assertions, pre/post operations are actions that contribute to the setup or cleanup of your main SQL query. They don't directly represent individual steps in the data transformation pipeline. Hence, they also don't count as separate SQL workflow nodes. Your SQLX file would still have 2 actions and 0 SQL workflow nodes.

  3. Actual SQL Query to Generate Table: This is the core element of your SQLX file, encompassing both an action and a SQL workflow node. The SQL query itself represents a distinct step in the data transformation process. Therefore, it counts as 1 action and 1 SQL workflow node.

  4. External Source Declaration: Assuming you're referring to defining a reference to an external data source, such as a link to another database or table, this doesn't typically count as a SQL workflow node. It's more akin to a configuration or reference setup. Your SQLX file would maintain 5 actions (2 assertions, 2 pre/post operations, and 1 SQL query) and 1 SQL workflow node (the SQL query itself).

Remember, the concept of actions and nodes serves to manage and optimize the compilation and execution process in Dataform. Actions represent individual tasks or operations, while SQL workflow nodes reflect the key steps in your data transformation pipeline.

View solution in original post

9 REPLIES 9