Solved: Re: post_operations {} or config { type: "operatio...

sysph · 07-11-2023 07:48 AM

Hi there

I have a question about choosing between utilising

post_operations {}

block to execute a SQL statement after an incremental table creation or creating another operation type

 config { type: "operations"}

block in GCP Dataform.

The goal is to update a column in the incremental data right after the new incremental records merge into the incremental table. Both incremental table sql query and update query reference to the same source table, so i thought maybe better to use post_operations {} block.

I'm not sure if the update operation will wait until the incremental data inserting complete.

My questions is does post_operations {} will wait until incremental merge completed? Thanks

ms4446

The post_operations block is designed to execute its SQL commands only after the incremental merge process has been fully completed. This is due to the fact that the post_operations block is triggered after the table creation, and the table creation itself is dependent on the completion of the merge.

In contrast, the config { type: "operations" } block does not necessarily wait for the completion of the merge. This could potentially lead to the execution of the update operation before the merge has finished, which might result in inconsistencies in your data.

So, if your requirement is to execute the update operation strictly after the merge process is finished, it would be advisable to use the post_operations block.

Here's an illustrative example of how you can utilize the post_operations block to update a column in the incremental data, right after the new incremental records have been merged into the incremental table:

post_operations {
sql = """
UPDATE incremental_table
SET updated_at = CURRENT_TIMESTAMP
WHERE id IN (
SELECT id
FROM incremental_source
WHERE incremental_source.created_at > incremental_table.created_at
)
"""
}

In the above code, the new incremental records are first merged into the incremental_table. Following this, the UPDATE statement is executed to refresh the updated_at column in the incremental_table. The UPDATE statement is run only after the merge is finalized, ensuring data consistency.

View solution in original post

ms4446

In terms of compute speed and storage usage, both methods should be similar as they are essentially executing the same SQL operations. The difference lies in how the SQL scripts are organized and executed.

The first method (definitions/backfill_batch.js) uses JavaScript to dynamically generate and execute the SQL scripts. This method provides more flexibility as you can use JavaScript to manipulate the SQL scripts, such as adding conditional logic or looping through an array of tables. However, it may be harder to debug if something goes wrong, as the SQL scripts are generated at runtime.

The second method (definitions/operation_backfilling_batch.sqlx) uses SQLX, a SQL extension developed by Dataform, to organize and execute the SQL scripts. SQLX provides a clean and straightforward way to manage SQL scripts, and it supports features like incremental tables, assertions, and operations. It's easier to debug as the SQL scripts are static and can be reviewed before execution.

In terms of storage, neither method should occupy significant storage as they are just scripts that execute SQL operations. The storage usage would be more related to the result of the SQL operations, such as the size of the tables created or updated.

In terms of manual execution, both methods can be manually executed by calling the operation in Dataform.

In conclusion, both methods should have similar performance in terms of compute speed and storage usage. The choice between the two methods would depend more on your preference for organizing and managing SQL scripts.

View solution in original post

ms4446

The post_operations block is designed to execute its SQL commands only after the incremental merge process has been fully completed. This is due to the fact that the post_operations block is triggered after the table creation, and the table creation itself is dependent on the completion of the merge.

In contrast, the config { type: "operations" } block does not necessarily wait for the completion of the merge. This could potentially lead to the execution of the update operation before the merge has finished, which might result in inconsistencies in your data.

So, if your requirement is to execute the update operation strictly after the merge process is finished, it would be advisable to use the post_operations block.

Here's an illustrative example of how you can utilize the post_operations block to update a column in the incremental data, right after the new incremental records have been merged into the incremental table:

post_operations {
sql = """
UPDATE incremental_table
SET updated_at = CURRENT_TIMESTAMP
WHERE id IN (
SELECT id
FROM incremental_source
WHERE incremental_source.created_at > incremental_table.created_at
)
"""
}

In the above code, the new incremental records are first merged into the incremental_table. Following this, the UPDATE statement is executed to refresh the updated_at column in the incremental_table. The UPDATE statement is run only after the merge is finalized, ensuring data consistency.

sysph

Thank you so much for the info, ms4446. This is exactly what I'm looking for.

EtienneC

Hi,

Does this actually imply that we cannot basically chain standard SQL scripts (with { type: "operations" }, typically starting with an insertion or creation and then various update scripts) since the use of dependencies does not guarantee that executions are complete before each dependency is resolved ? That sounds rather unexpected / undesirable.

Etienne

sysph

Hi there,

I have an question is the extension of this topic. Hope you'll be able to answer. Which one has faster compute and not occupy storage? They both can be manual execution. Thank you in advance.

Is using definitions/backfill_batch.js (i can't use resolve("master") )

// definitions/file.js
//operate("an-operation", ["SELECT 1", "SELECT 2"])

var sourceTable = "`analytics-pipeline.source.master`";
var dailyArticlePerformance = "`analytics-pipeline.reporting.daily_article_performance`";
//pipeline_queries.js contains all pipeline queries
var insertselectStatement = pipeline_queries.insertQuery + dailyArticlePerformance 
                    + pipeline_queries.select_daily_article_performance;
var whereCondition = pipeline_queries.wherePart1_daily_article_performance 
                   + pipeline_queries.wherePart2_daily_article_performance_batchBackfilling;

var groupByStatement = pipeline_queries.groupBy_daily_article_performance;

operate(`backfill `,`${script_builder.render_script(sourceTable, insertselectStatement, whereCondition, groupByStatement)}`);

or using definitions/operation_backfilling_batch.sqlx

config { 
    type: "operations",
    schema:"reporting",
    hasOutput: true,
    }

${script_builder.render_script(
    resolve("master"),
    pipeline_queries.insertQuery + resolve("daily_article_performance")
    + pipeline_queries.select_daily_article_performance,
    pipeline_queries.wherePart1_daily_article_performance 
    + pipeline_queries.wherePart2_daily_article_performance_batchBackfilling,
    pipeline_queries.groupBy_daily_article_performance)}

includes/script_builder.js

//render_script returns a query
function render_script(fromTable, selectStatement, whereConditions, groupByStatement) {
  return `${selectStatement}
      FROM ${fromTable}
      ${whereConditions}
      ${groupByStatement}
    `;
}


module.exports = { 
  render_script,

}

const insertQuery = 'insert into ';
...
module.exports = { 
  insertQuery,
  select_daily_article_performance,
  wherePart1_daily_article_performance,
  wherePart2_daily_article_performance,
  wherePart2_daily_article_performance_batchBackfilling,
  groupBy_daily_article_performance,
  
};

ms4446