Incremental PDT creates duplicate records - Page 2

jo284 · 06-06-2024 07:05 AM

I have an incremental PDT set up with a pipeline_run_date (truncated to the day) variable used as the increment_key and no increment_offset (this is not needed for my case). The pipeline_run_date reliably increases for new data, so should meet the requirements for an increment_key.

The sql_trigger used is the count of distinct pipeline runs, so when a new run is completed, the table is triggered to build a new increment.

To my understanding, what should be happening is the date of the last pipeline run should be identified, then data from that day deleted, then any data from that date onwards is added.

However, I am finding that for previous runs, duplicate records area appearing, so the data is getting appended again at some point. I'm not sure why this is happening, when the data prior to the day of the last pipeline run shouldn't be touched.

When I first complete the new runs, the data is not duplicated, so it is presumably happening when future increments are added, after a trigger.

For extra info, here is the definition of the dimension used for the increment key:

dimension_group: pipeline_run {
type: time
timeframes: [raw, time, date, month]
sql: ${TABLE}.date_of_pipeline_run ;;
}
and the increment_key is set with the following line, which I believe takes the pipeline_run dimension above and converts it into a date format:

increment_key: "pipeline_run_date"