Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Equivalent of Insert Overwrite strategy in DataForm

Hello, i'm testing DataForm instead of DBT, and i'm missing a feature we use a lot. 

In DBT incremental mode we have a mode called incremental_overwrite, what it does is the following:

```

- create a temp table with the data you want to add incrementally (table_temp)

- set partitions_to_delete = select distinct <partition_column> from table_temp

- MERGE on <table> source USING(table_temp dest) ON False

when not matched by source and dest.<partition_column> IN (partitions_to_delete) then delete

when not matched by source then insert ...

```

It's pretty nice when the data volume is importance (making merge too slow) and no clear id exists to deduplicate. 

I'm looking to do this and struggle a bit, i saw those conversations about custom merge logic: https://www.googlecloudcommunity.com/gc/Data-Analytics/Use-custom-merge-logic-for-incremental-table-... and the option Creating a Custom Operations SQLX File. 

Would those same options apply ? or is there an easier way ? or a repo where i could check a few merge strategies already written ? 

 

Thanks

 

2 1 1,071
1 REPLY 1

While Dataform doesn't have a direct equivalent to incremental_overwrite, you can achieve a similar effect with the following approach:

Custom Merge Logic:

  1. Incremental Table: Configure your Dataform table as type: "incremental".
  2. Custom SQLX Operations: Create a SQLX operations file to house your merge logic within a post_operations block. This will handle partition deletion and insertion based on your data.

Example:

 
config { type: "incremental" } 

pre_operations {
  -- Logic to create/populate 'table_temp' if needed
}

post_operations {
  DELETE FROM `${self()}` 
  WHERE <partition_column> IN (SELECT DISTINCT <partition_column> FROM table_temp);

  INSERT INTO `${self()}`
  SELECT * FROM table_temp;
}

Important Considerations:

  • Partitioning: Partition your target table by the partition_column for optimal efficiency.
  • Error Handling: Incorporate error handling in your SQLX operations to ensure robustness.
  • Dataform Version: Be mindful of potential syntax differences between Dataform versions.