Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform - Incremental Update (MERGE), But Delete When Not Matched By Source

Hi,

I'm working with Dataform and seeking advice on handling a specific case. In my current setup, I'm using Dataform's incremental model but facing a challenge with handling rows that aren't matched by the source. Specifically, I need to delete these unmatched rows from the target table, a functionality that doesn't seem to be covered by the default Dataform script.

The use case involves a table of orders, where orders in draft form are present. My objective is to maintain only the final version of each order in the table, when some of the products have been removed after submitting.

I'm considering using a post_operation block to complement the initial merge statement.

if anyone has any ideas on whether this is possible in Dataform, how to set it up, and if possible to avoid a full scan while doing it I'd really appreciate your input.

Thanks for your help!

Solved Solved
3 5 3,133
1 ACCEPTED SOLUTION

You're correct that the default Dataform incremental model doesn't inherently handle the deletion of unmatched rows. However, you can achieve this by combining a merge statement with a post_operation block. Here's a revised approach to set it up:

  1. Define the Merge Statement:

    Use a merge statement to handle updates and inserts. Ensure that the columns in the update set and insert clauses match your specific table schema.

     
    merge into `target_table` t using `source_table` s
    on t.order_id = s.order_id
    when matched then update set
      product_1 = s.product_1,
      product_2 = s.product_2,
      ...
    when not matched then insert (...) values (...);
    
  2. Add a Post-operation Block:

    Implement a post-operation block to delete rows from the target table that are not present in the source table. Ensure proper indexing on order_id or the join columns for efficiency.

     
    post_operation {
      delete from `target_table`
      where order_id not in (
        select order_id from `source_table`
      );
    }
    

    This setup first performs the standard merge operation, then the post_operation block deletes all rows from target_table that are not present in source_table, based on order_id.

  3. Consider Using Scheduled Triggers:

    If the deletion operation is resource-intensive, consider using a scheduled_trigger to manage the deletion process separately. This can help in optimizing performance and managing resources more effectively. 

     
    config {
      schedule_triggers {
        trigger_name: "delete_unmatched_orders"
        schedule: "every 1 hour"
      }
    }
    ...
    
    operation "delete_unmatched_orders" {
      sql: """
        delete from `target_table`
        where order_id not in (
          select order_id from `source_table`
        );
      """
    }
    

    This approach provides flexibility in scheduling and managing the deletion process.

  4. Testing and Validation:

    Thoroughly test your setup to ensure it works as expected. Validate that both the merge and deletion operations perform correctly and that no unintended data loss occurs.

View solution in original post

5 REPLIES 5

You're correct that the default Dataform incremental model doesn't inherently handle the deletion of unmatched rows. However, you can achieve this by combining a merge statement with a post_operation block. Here's a revised approach to set it up:

  1. Define the Merge Statement:

    Use a merge statement to handle updates and inserts. Ensure that the columns in the update set and insert clauses match your specific table schema.

     
    merge into `target_table` t using `source_table` s
    on t.order_id = s.order_id
    when matched then update set
      product_1 = s.product_1,
      product_2 = s.product_2,
      ...
    when not matched then insert (...) values (...);
    
  2. Add a Post-operation Block:

    Implement a post-operation block to delete rows from the target table that are not present in the source table. Ensure proper indexing on order_id or the join columns for efficiency.

     
    post_operation {
      delete from `target_table`
      where order_id not in (
        select order_id from `source_table`
      );
    }
    

    This setup first performs the standard merge operation, then the post_operation block deletes all rows from target_table that are not present in source_table, based on order_id.

  3. Consider Using Scheduled Triggers:

    If the deletion operation is resource-intensive, consider using a scheduled_trigger to manage the deletion process separately. This can help in optimizing performance and managing resources more effectively. 

     
    config {
      schedule_triggers {
        trigger_name: "delete_unmatched_orders"
        schedule: "every 1 hour"
      }
    }
    ...
    
    operation "delete_unmatched_orders" {
      sql: """
        delete from `target_table`
        where order_id not in (
          select order_id from `source_table`
        );
      """
    }
    

    This approach provides flexibility in scheduling and managing the deletion process.

  4. Testing and Validation:

    Thoroughly test your setup to ensure it works as expected. Validate that both the merge and deletion operations perform correctly and that no unintended data loss occurs.

Thanks for your response, this is helpful. I am wondering if there is any solution that doesn't involve a full table scan as a post operation after the standard merge requires a full table scan?

Avoiding a full table scan for the deletion of unmatched rows in an incremental update scenario can be challenging, but there are strategies you can employ to make the process more efficient. The key is to limit the scope of the deletion to only those rows that are likely to be affected by the recent changes.

Here are some approaches:

1. Timestamp-Based Filtering:

  • Leverage timestamp columns (e.g., last_updated) to filter rows needing deletion.
  • Scan only rows updated since the last Dataform run.

Example SQL:

 
post_operation {
  delete from `target_table`
  where last_updated < [timestamp of last successful run]
  and order_id not in (select order_id from `source_table`);
}

2. Incremental Deletion with Temporary Table:

  • Create a temporary table storing processed IDs (e.g., order_id) from the current run.
  • Use the temporary table to limit deletion to unprocessed rows.

Example SQL:

 
-- Create temporary table with processed IDs
create temporary table processed_ids as select order_id from `source_table`;

-- Delete using the temporary table
delete from `target_table` where order_id not in (select order_id from processed_ids);

3. Partitioning:

  • Utilize table partitioning, especially by date, to restrict deletion to relevant partitions.
  • Reduces data scanned during deletion.

4. Change Data Capture (CDC):

  • If your source system supports CDC, track deletions and only delete CDC-flagged rows in the target table.
  • Requires additional setup in the source system.

5. Use of Indexes:

  • Ensure proper table indexing, especially on join columns.
  • Efficient indexing can significantly reduce full table scan cost.

6. Batch Processing:

  • For large data volumes, consider breaking down the deletion into smaller batches.
  • Improves manageability and resource optimization.

Remember:

  • Each method has trade-offs and may require additional setup or maintenance.
  • Choose the best approach based on your specific data, update frequency, and source system capabilities.
  • Thoroughly test any new method to ensure it meets your performance and accuracy requirements.


@ms4446 wrote:

scheduled_trigger


What would the scheduled trigger be? Is it a native action of Dataform?

The scheduled_trigger feature you are referring to is a part of Dataform's configuration capabilities.

In Dataform, you use scheduled_trigger to set up automated and recurring executions of your Dataform SQL workflows.

How to Set up a scheduled_trigger in Dataform:

  1. Within your Dataform project's definitions directory:

    • Create a .df.json file (e.g.,my_dataform_project.df.json) to house Dataform configurations.
  2. Structure the scheduled_trigger within the configuration file:

    {
        "config": {
          "schedule_triggers": [  // Array to define multiple triggers
            {
              "trigger_name": "your_unique_trigger_name",
              "schedule": "your_cron_expression"
            }
          ]
        },
    
        // ... other Dataform config and operations (views, assertions, etc.)
    }
    
  • Explanation:
    • trigger_name: Provide a unique name to identify your trigger.
    • schedule: Use a cron expression to define the execution schedule. Cron expression offers great flexibility in setting up intervals (e.g., "every 1 hour", "every day at midnight", "every Monday at 8AM",etc.)

Example:

JSON
 
{
  "config":  {
    "schedule_triggers": [
      {
        "trigger_name": "hourly_data_pipeline",
        "schedule": "0 * * * *"   // Runs every hour
      }
    ]
  }
}

Important Notes:

  • Linking to Operations: The scheduled_trigger itself does not define which SQLX files/operations to execute. You need to link the scheduled trigger to specific Dataform SQLX files using the file or refproperty within your Dataform definitions.(See https://cloud.google.com/dataform/docs)
  • Deployment: After creating the configuration file, you use the Dataform CLI command (dataform deploy) to deploy your project and make the scheduled trigger active.
  • Cron Expression Syntax: Online resources or tools can help you construct the correct cron expression for your desired schedule. A typical cron expression has 5 fields separated by spaces, representing: (minutes, hours, day of the month, month, day of the week)

Key Advantages of using scheduled_trigger in Dataform:

  • Automation: Executes your data transformations or data cleaning tasks on a recurring schedule without manual intervention.
  • Regular Data Updates: Ensures that your target tables or views are kept up-to-date based on your defined schedule.
  • Centralized Management: Provides a clear overview and control of scheduled jobs within your Dataform project.