I have a table that has field "Updated_TS" , which is the time at which the record was updated. I am setting up a Data Quality Scan with the scope "Incremental" and I have selected "Updated_TS" as timestamp column.
The table has data for past 2 years.
Now, what would happen during the initial and consecutive runs of this Data Quality Scan.
All, I am looking is to scan for the records that was updated "current_date - 1", when I run this Data Quality scan Daily at 8 AM.
Solved! Go to Solution.
DataPlex’s Data Quality Scan, when configured in incremental mode with the Updated_TS timestamp column, uses a watermark mechanism to determine which records to process. In the initial run—absent any previously stored state—the scan processes all records in the table, spanning two years of data, and then records the maximum Updated_TS value as the high watermark. This watermark serves as the baseline for subsequent scans, which will only consider records updated after this recorded timestamp.
However, this default behavior does not align with the specific goal of scanning only records updated during the previous day. In consecutive runs, the scan will include any records with an Updated_TS greater than the previous high watermark, regardless of whether those records fall within the desired “previous day” timeframe. This means that without further refinement, the scan may process data outside the intended window or even return no results if no updates exceed the existing watermark.
To precisely target records updated between 12:00 AM and 11:59 PM on the day before the scan runs, a row filter must be applied. For example, in BigQuery one might add the condition:
This row filter restricts the scan to only those records whose date part of Updated_TS matches yesterday’s date. With the filter in place, even the initial run will effectively limit processing to records from the previous day, storing the maximum timestamp from that filtered set as the high watermark. Subsequent runs will continue to use both the incremental logic and the row filter, ensuring that only the intended subset of data is evaluated each day.
It is essential to ensure that the time zones for Updated_TS, the DataPlex scan settings, and the CURRENT_DATE() function are all aligned, as discrepancies can lead to incorrect filtering. Additionally, if there is a possibility of late-arriving or backfilled data, further adjustments—such as implementing a lookback window—may be necessary.
While the default incremental scan behavior in DataPlex does not restrict scans to a specific day’s records, incorporating an appropriate row filter enables the scan to target only records updated on the previous day, thereby meeting the desired objective efficiently.
DataPlex’s Data Quality Scan, when configured in incremental mode with the Updated_TS timestamp column, uses a watermark mechanism to determine which records to process. In the initial run—absent any previously stored state—the scan processes all records in the table, spanning two years of data, and then records the maximum Updated_TS value as the high watermark. This watermark serves as the baseline for subsequent scans, which will only consider records updated after this recorded timestamp.
However, this default behavior does not align with the specific goal of scanning only records updated during the previous day. In consecutive runs, the scan will include any records with an Updated_TS greater than the previous high watermark, regardless of whether those records fall within the desired “previous day” timeframe. This means that without further refinement, the scan may process data outside the intended window or even return no results if no updates exceed the existing watermark.
To precisely target records updated between 12:00 AM and 11:59 PM on the day before the scan runs, a row filter must be applied. For example, in BigQuery one might add the condition:
This row filter restricts the scan to only those records whose date part of Updated_TS matches yesterday’s date. With the filter in place, even the initial run will effectively limit processing to records from the previous day, storing the maximum timestamp from that filtered set as the high watermark. Subsequent runs will continue to use both the incremental logic and the row filter, ensuring that only the intended subset of data is evaluated each day.
It is essential to ensure that the time zones for Updated_TS, the DataPlex scan settings, and the CURRENT_DATE() function are all aligned, as discrepancies can lead to incorrect filtering. Additionally, if there is a possibility of late-arriving or backfilled data, further adjustments—such as implementing a lookback window—may be necessary.
While the default incremental scan behavior in DataPlex does not restrict scans to a specific day’s records, incorporating an appropriate row filter enables the scan to target only records updated on the previous day, thereby meeting the desired objective efficiently.
Hi ms4446
Any inputs on the below would be highly appreciated?
we have set the below things in our scan
- scope incremental
- threshold - 0.99
- notification report
- job failure trigger - True
Scorethreshold_trigger -100%
in addition to other required parameters. we have configured the above parameters for 1000+ scans
We are seeing two alerts mails for each scans that fails. We are trying to understand how this works and how to restrict the mail alert to once per scans