Hello Dataform Community,
I'm currently working with Dataform and trying to better understand some of the specifics around its repository size limitations and how it manages compilation resources. I have a couple of questions and would greatly appreciate any insights or experiences you might share.
Dataform enforces certain usage limits during the compilation of a repository, such as CPU time usage, maximum heap memory, and notably, a maximum number of actions per repository compilation. The limits I'm particularly curious about are:
Maximum Number of Actions Per Repository Compilation: The limit is set at 5000. Here, I understand that actions refer to defined units of work (like SQL queries) within the data pipeline.
Maximum Repository Size Indicated by SQL Workflow Nodes: There's a mention that large repository sizes, specifically those with more than 1000 SQL workflow nodes, might lead to issues.
My confusion arises from trying to understand the distinction between these two limits. Aren't individual workflow nodes equivalent to actions in this context? If so, why is there a discrepancy between the allowed number of actions (5000) and the suggested maximum number of SQL workflow nodes (1000) for a stable repository?
Additionally, if anyone has tips or best practices on managing and optimizing repositories within these limits, or experiences with the compilation resource usage estimation, I would be very interested to hear about them. I saw the best practices doc on splitting up the repositories. However, I want to know if there is anyway I can calculate/estimate the feasibility of having mono-repo for our existing workflows we have with other tool.
Thank you in advance for your time and assistance!
Solved! Go to Solution.
Hi @ylee1212
Your understanding of the distinction between actions and SQL workflow nodes in a Dataform SQLX file is generally accurate. Let me further clarify:
Assertions: Each assertion is indeed an action, serving as a check or validation on the data. However, they are not typically counted as separate SQL workflow nodes. Rather, they are considered part of the data quality assessment process. Therefore, your SQLX file would have 2 actions but 0 SQL workflow nodes.
Pre/Post Operations: Similar to assertions, pre/post operations are actions that contribute to the setup or cleanup of your main SQL query. They don't directly represent individual steps in the data transformation pipeline. Hence, they also don't count as separate SQL workflow nodes. Your SQLX file would still have 2 actions and 0 SQL workflow nodes.
Actual SQL Query to Generate Table: This is the core element of your SQLX file, encompassing both an action and a SQL workflow node. The SQL query itself represents a distinct step in the data transformation process. Therefore, it counts as 1 action and 1 SQL workflow node.
External Source Declaration: Assuming you're referring to defining a reference to an external data source, such as a link to another database or table, this doesn't typically count as a SQL workflow node. It's more akin to a configuration or reference setup. Your SQLX file would maintain 5 actions (2 assertions, 2 pre/post operations, and 1 SQL query) and 1 SQL workflow node (the SQL query itself).
Remember, the concept of actions and nodes serves to manage and optimize the compilation and execution process in Dataform. Actions represent individual tasks or operations, while SQL workflow nodes reflect the key steps in your data transformation pipeline.
Hi @ylee1212 ,
Dataform imposes certain limitations to ensure efficient resource utilization and prevent performance bottlenecks. Two key limits to consider are:
Maximum Number of Actions Per Repository Compilation: This technical constraint is set at 5000 actions, which includes a variety of tasks such as SQL queries, assertions, and data transformations. Exceeding this limit can lead to compilation errors and hinder data pipeline execution.
Maximum Repository Size Indicated by SQL Workflow Nodes: While not a strict constraint, a repository with more than 1000 SQL workflow nodes may experience performance degradation and increased complexity. This guideline serves as a recommendation to maintain optimal pipeline efficiency and manageability, particularly since SQL workflow nodes often involve more complex and resource-intensive operations.
To effectively manage and optimize Dataform repositories, consider these strategies:
Migrating to a mono-repo approach for Dataform repositories can offer benefits, but careful consideration is required. Evaluate the following factors:
Regularly monitor the performance of your data pipelines to identify and address performance bottlenecks or optimization opportunities. This proactive approach ensures that your data pipelines operate efficiently and reliably.
For repositories approaching the upper limits, consider modularizing the pipeline, using environment-specific configurations, and implementing efficient error handling. These practices can significantly enhance the manageability and performance of large repositories.
Thanks so much for your response @ms4446 !
To confirm if I understood the concept clearly, let's say if I have a SQLX file that is used to create one table/view. The table is equivalent to one SQL workflow node. Within the SQLX file, I have these operations:
1. 2 assertion - so 2 actions and 2 node?
2. pre/post operations - so 2 action but 0 node.
3. Actual SQL query to generate table - 1 action and 1 node.
Thus, within my single SQLX file, I have 5 actions and 3 SQL nodes. Is my understanding correct? Does External source declaration also count as a SQL node?
Hi @ylee1212
Your understanding of the distinction between actions and SQL workflow nodes in a Dataform SQLX file is generally accurate. Let me further clarify:
Assertions: Each assertion is indeed an action, serving as a check or validation on the data. However, they are not typically counted as separate SQL workflow nodes. Rather, they are considered part of the data quality assessment process. Therefore, your SQLX file would have 2 actions but 0 SQL workflow nodes.
Pre/Post Operations: Similar to assertions, pre/post operations are actions that contribute to the setup or cleanup of your main SQL query. They don't directly represent individual steps in the data transformation pipeline. Hence, they also don't count as separate SQL workflow nodes. Your SQLX file would still have 2 actions and 0 SQL workflow nodes.
Actual SQL Query to Generate Table: This is the core element of your SQLX file, encompassing both an action and a SQL workflow node. The SQL query itself represents a distinct step in the data transformation process. Therefore, it counts as 1 action and 1 SQL workflow node.
External Source Declaration: Assuming you're referring to defining a reference to an external data source, such as a link to another database or table, this doesn't typically count as a SQL workflow node. It's more akin to a configuration or reference setup. Your SQLX file would maintain 5 actions (2 assertions, 2 pre/post operations, and 1 SQL query) and 1 SQL workflow node (the SQL query itself).
Remember, the concept of actions and nodes serves to manage and optimize the compilation and execution process in Dataform. Actions represent individual tasks or operations, while SQL workflow nodes reflect the key steps in your data transformation pipeline.
Got it!! Thanks for the clarification! My understanding of node was equivalent to each nodes that show up in the compiled graph. That was the reason why I was counting assertions as a node since I can set dependencies on the nodes in the compile graph to other SQLX files.
Hello , Thank you for these clarifications, just I have a question about declaration in dataform , it's considered as action or a workflow node ?
In Dataform, a declaration is typically considered as an action but not as a SQL workflow node. Here's how it breaks down:
Declaration as an Action: A declaration in Dataform is used to define datasets that exist in your warehouse but are not created by your Dataform project. Since it involves specifying metadata about these datasets (like their schema, name, etc.), it is considered an action. This action is more about informing Dataform of the dataset's existence and structure rather than performing a data transformation or computation.
Not a SQL Workflow Node: SQL workflow nodes generally represent the key steps in your data transformation pipeline, such as creating tables, views, or executing complex SQL queries. Declarations don't fit into this category because they don't actively transform or manipulate data; they simply declare the presence of external datasets.
So, when you declare a dataset in Dataform, it counts towards your total number of actions but does not add to your count of SQL workflow nodes. This distinction is important for understanding how your project will be compiled and executed within Dataform's limits and constraints.
Thanks for your reply , we have more than 500 tables (declaration) as input for our dataform pipeline, how can I optimise these declarations to avoid compilation issues
we don't want to split our repository..
Thank you
Managing a large number of table declarations (over 500 in your case) in Dataform without splitting the repository can be challenging, but there are strategies you can employ to optimize these declarations and avoid compilation issues:
Group Declarations: If possible, group similar table declarations together. This can be done by creating a generic declaration template and then passing different parameters (like table names, schemas, etc.) to it. This approach can reduce the overall number of individual declaration actions.
Use Wildcard Declarations: If your data warehouse supports it, and if your tables follow a consistent naming pattern, you might be able to use wildcard declarations. This allows you to declare multiple tables with a single action by specifying a pattern rather than individual table names.
Optimize SQLX Files: If you have multiple declarations in separate SQLX files, consider consolidating them into fewer files. While this doesn't reduce the total number of declarations, it can sometimes help with organization and reduce the overhead of managing many separate files.
Leverage Incremental Models: For tables that are frequently updated, consider using incremental models. This approach allows Dataform to process only the new or changed data, which can significantly reduce the compilation and execution load.
Review and Clean Up Unused Declarations: Over time, some tables might become obsolete or unused. Regularly review your declarations and remove any that are no longer needed. This not only reduces the number of actions but also declutters your project.
Use Environment-Specific Configurations: If some tables are only relevant in certain environments (like development, staging, or production), use environment-specific configurations to declare them only in the relevant environments. This can reduce the load during compilation in each environment.
Monitor and Optimize Performance: Regularly monitor the performance of your Dataform project. Look for bottlenecks or inefficiencies in the compilation process and address them promptly.
Hello,
Could you answer some questions for me, please?
How many workspaces can we have per repository?
If I have an application with multiple services, can I have one app repository and one workspace per service?
I would also like to know the best practices for crossing data between different workspaces and also between different repositories.
Thank you so much!