How to properly delete data from incremental table

Hi, 

I have a dataform incremental table where I would need to delete the past 3 days of data from the table. However, it seems like the `delete from ${self()} where event_date >= event_date_checkpoint` is scanning the entire table (as it shows the script will process 930.32 GB). If I run the delete script on BQ UI, it will be only 0 byte.

This will be very costly when running the incremental table everyday if the delete function itself is processing 900 GB. May I know if the suggestion of process byte is a correct value? If yes, how do I optimize the delete function to only process the specific partition date

Screenshot 2024-04-10 at 22.47.55.png

 Any guidance or suggestions would be greatly appreciated!

 

6 1 60
1 REPLY 1

Hi @Yinnnnn ,

It seems the current operation isn't fully leveraging your table's partitioning, leading to unnecessary scans and increased costs.

Here's how you can optimize this:

Direct Partition Deletion:

Since your deletion criteria align perfectly with the event_date partition, this is the most efficient approach. The following query directly targets the relevant partitions:

 
DELETE FROM your_table
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
  AND event_date < CURRENT_DATE();

Dataform when Clause (Optional):

For more granular control over deletion timing, use the when clause in your Dataform configuration:

 
config {
    type: "incremental",
    ...
    when: options.full_refresh AND options.run_deletions
}

This ensures deletions only occur during full refreshes when options.run_deletions is true.

Additional Recommendations:

  • Query Performance: The order of conditions and use of indexes can impact performance. Experiment to find the optimal structure.
  • Batching: For large datasets, consider breaking the deletion into smaller batches to avoid resource contention.
  • Monitoring: Use BigQuery's monitoring tools to track query performance and costs for ongoing optimization.