Hi,
I'm working with Dataform and seeking advice on handling a specific case. In my current setup, I'm using Dataform's incremental model but facing a challenge with handling rows that aren't matched by the source. Specifically, I need to delete these unmatched rows from the target table, a functionality that doesn't seem to be covered by the default Dataform script.
The use case involves a table of orders, where orders in draft form are present. My objective is to maintain only the final version of each order in the table, when some of the products have been removed after submitting.
I'm considering using a post_operation block to complement the initial merge statement.
if anyone has any ideas on whether this is possible in Dataform, how to set it up, and if possible to avoid a full scan while doing it I'd really appreciate your input.
Thanks for your help!
Solved! Go to Solution.
You're correct that the default Dataform incremental model doesn't inherently handle the deletion of unmatched rows. However, you can achieve this by combining a merge statement with a post_operation block. Here's a revised approach to set it up:
Define the Merge Statement:
Use a merge statement to handle updates and inserts. Ensure that the columns in the update set
and insert
clauses match your specific table schema.
merge into `target_table` t using `source_table` s
on t.order_id = s.order_id
when matched then update set
product_1 = s.product_1,
product_2 = s.product_2,
...
when not matched then insert (...) values (...);
Add a Post-operation Block:
Implement a post-operation block to delete rows from the target table that are not present in the source table. Ensure proper indexing on order_id
or the join columns for efficiency.
post_operation {
delete from `target_table`
where order_id not in (
select order_id from `source_table`
);
}
This setup first performs the standard merge operation, then the post_operation block deletes all rows from target_table
that are not present in source_table
, based on order_id
.
Consider Using Scheduled Triggers:
If the deletion operation is resource-intensive, consider using a scheduled_trigger
to manage the deletion process separately. This can help in optimizing performance and managing resources more effectively.
config {
schedule_triggers {
trigger_name: "delete_unmatched_orders"
schedule: "every 1 hour"
}
}
...
operation "delete_unmatched_orders" {
sql: """
delete from `target_table`
where order_id not in (
select order_id from `source_table`
);
"""
}
This approach provides flexibility in scheduling and managing the deletion process.
Testing and Validation:
Thoroughly test your setup to ensure it works as expected. Validate that both the merge and deletion operations perform correctly and that no unintended data loss occurs.
You're correct that the default Dataform incremental model doesn't inherently handle the deletion of unmatched rows. However, you can achieve this by combining a merge statement with a post_operation block. Here's a revised approach to set it up:
Define the Merge Statement:
Use a merge statement to handle updates and inserts. Ensure that the columns in the update set
and insert
clauses match your specific table schema.
merge into `target_table` t using `source_table` s
on t.order_id = s.order_id
when matched then update set
product_1 = s.product_1,
product_2 = s.product_2,
...
when not matched then insert (...) values (...);
Add a Post-operation Block:
Implement a post-operation block to delete rows from the target table that are not present in the source table. Ensure proper indexing on order_id
or the join columns for efficiency.
post_operation {
delete from `target_table`
where order_id not in (
select order_id from `source_table`
);
}
This setup first performs the standard merge operation, then the post_operation block deletes all rows from target_table
that are not present in source_table
, based on order_id
.
Consider Using Scheduled Triggers:
If the deletion operation is resource-intensive, consider using a scheduled_trigger
to manage the deletion process separately. This can help in optimizing performance and managing resources more effectively.
config {
schedule_triggers {
trigger_name: "delete_unmatched_orders"
schedule: "every 1 hour"
}
}
...
operation "delete_unmatched_orders" {
sql: """
delete from `target_table`
where order_id not in (
select order_id from `source_table`
);
"""
}
This approach provides flexibility in scheduling and managing the deletion process.
Testing and Validation:
Thoroughly test your setup to ensure it works as expected. Validate that both the merge and deletion operations perform correctly and that no unintended data loss occurs.
Thanks for your response, this is helpful. I am wondering if there is any solution that doesn't involve a full table scan as a post operation after the standard merge requires a full table scan?
Avoiding a full table scan for the deletion of unmatched rows in an incremental update scenario can be challenging, but there are strategies you can employ to make the process more efficient. The key is to limit the scope of the deletion to only those rows that are likely to be affected by the recent changes.
Here are some approaches:
1. Timestamp-Based Filtering:
last_updated
) to filter rows needing deletion.Example SQL:
post_operation {
delete from `target_table`
where last_updated < [timestamp of last successful run]
and order_id not in (select order_id from `source_table`);
}
2. Incremental Deletion with Temporary Table:
order_id
) from the current run.Example SQL:
-- Create temporary table with processed IDs
create temporary table processed_ids as select order_id from `source_table`;
-- Delete using the temporary table
delete from `target_table` where order_id not in (select order_id from processed_ids);
3. Partitioning:
4. Change Data Capture (CDC):
5. Use of Indexes:
6. Batch Processing:
Remember:
@ms4446 wrote:scheduled_trigger
What would the scheduled trigger be? Is it a native action of Dataform?
The scheduled_trigger
feature you are referring to is a part of Dataform's configuration capabilities.
In Dataform, you use scheduled_trigger
to set up automated and recurring executions of your Dataform SQL workflows.
How to Set up a scheduled_trigger
in Dataform:
Within your Dataform project's definitions
directory:
.df.json
file (e.g.,my_dataform_project.df.json
) to house Dataform configurations.Structure the scheduled_trigger
within the configuration file:
{
"config": {
"schedule_triggers": [ // Array to define multiple triggers
{
"trigger_name": "your_unique_trigger_name",
"schedule": "your_cron_expression"
}
]
},
// ... other Dataform config and operations (views, assertions, etc.)
}
trigger_name
: Provide a unique name to identify your trigger.schedule
: Use a cron expression to define the execution schedule. Cron expression offers great flexibility in setting up intervals (e.g., "every 1 hour", "every day at midnight", "every Monday at 8AM",etc.)Example:
{
"config": {
"schedule_triggers": [
{
"trigger_name": "hourly_data_pipeline",
"schedule": "0 * * * *" // Runs every hour
}
]
}
}
Important Notes:
scheduled_trigger
itself does not define which SQLX files/operations to execute. You need to link the scheduled trigger to specific Dataform SQLX files using the file
or ref
property within your Dataform definitions.(See https://cloud.google.com/dataform/docs)dataform deploy
) to deploy your project and make the scheduled trigger active.Key Advantages of using scheduled_trigger
in Dataform: