Incremental PDT creates duplicate records

jo284 · 06-06-2024 07:05 AM

I have an incremental PDT set up with a pipeline_run_date (truncated to the day) variable used as the increment_key and no increment_offset (this is not needed for my case). The pipeline_run_date reliably increases for new data, so should meet the requirements for an increment_key.

The sql_trigger used is the count of distinct pipeline runs, so when a new run is completed, the table is triggered to build a new increment.

To my understanding, what should be happening is the date of the last pipeline run should be identified, then data from that day deleted, then any data from that date onwards is added.

However, I am finding that for previous runs, duplicate records area appearing, so the data is getting appended again at some point. I'm not sure why this is happening, when the data prior to the day of the last pipeline run shouldn't be touched.

When I first complete the new runs, the data is not duplicated, so it is presumably happening when future increments are added, after a trigger.

For extra info, here is the definition of the dimension used for the increment key:

dimension_group: pipeline_run {
type: time
timeframes: [raw, time, date, month]
sql: ${TABLE}.date_of_pipeline_run ;;
}
and the increment_key is set with the following line, which I believe takes the pipeline_run dimension above and converts it into a date format:

increment_key: "pipeline_run_date"

andy4

Thanks for the well-crafted question! It sounds like you have a very good understanding of how this feature ought to work.

When I create a PDT with an increment, I like to reduce the chance for unexpected things to occur by always using a field directly from the group by clause of my PDT query. In this case you are using a casting to date of the "date_of_pipeline_run" field. Try using "pipeline_run_raw" instead, that way your increment key will match a field in your PDT query exactly. There could be a timezone conversion or a unexpected timestamp conversion happening you didn't foresee.

An alternative would be to use an NDT (Native Derived Table) to define your PDT query instead of SQL. That way you can be more certain that the underlying query is working correctly. Just make sure you use the _date option from the dimension group in your PDT definition, and it should just work. I always recommend NDTs even though you sometimes have to build the source_explore just for this purpose... that "unnecessary" exercise itself is often illuminating of the core issue at hand.

Sometimes I find I also need to add a convert_tz: no to the dimension in the PDT so that's another thing to try. See what happens when you try and use the PDT in another query - I like to prevent Looker from using timezone stuff in the SQL for a field which is stored in the database without time info.

Hope those ideas help you troubleshoot!

jo284

Thanks for your suggestions!

I have done some more investigation and discovered that it seems that generally, for every additional run of the pipeline that is done, another duplication of the existing data in the table occurs. So, when I first perform a run, data from my new run is not duplicated; if I do a second run, the first run would have one set of duplicates (two of each value in total), if I do a third, the first run would have two sets of duplicates (three of each value in total), etc. The table is not doubling each time, it's just one additional set of data being appended each time, as if the whole table is being appended again at each trigger, rather than just the new data.
There are, however, a couple of exceptions when one less duplication than usual has occurred.

For info, the PDT build in Admin has the following for the last build, which was at 2024-06-04 09:17:40 +0000:
WHEN NOT MATCHED BY SOURCE AND aggregated_indices.date_of_pipeline_run >= TIMESTAMP('2024-06-03 00:00:00') THEN DELETE
WHEN NOT MATCHED THEN INSERT [column names]
One possible cause I've just considered is that in the MERGE INTO ... WHERE statement that comes before this, there is a hard-coded line:

AND TIMESTAMP_TRUNC(_PARTITIONTIME, MONTH) >= TIMESTAMP("2023-01-01")

This was added because filtering on the partition column was forced (this is how it's configured in BigQuery), and it was thought that using a date prior to when runs began would simply mean no data was excluded when performing the query, and wouldn't interfere with the incrementing. However, perhaps what is happening is that all data after Jan 2023 is being appended again each time. Does that sound like it might be the issue? If so, I suppose I will need this timestamp to match the latest truncated pipeline_run_date (or pipeline_run_raw) value.

I'm not sure though, because I would have thought the forced filter on the BigQuery table was irrelevant to/not required for incrementing?