Dataplex Quality scan cost optimizing

mzientkowski · 02-19-2024 04:36 AM

Hi,
I'm wondering if there's some way to configure a custom query in Dataplex's data quality scans in a cost-optimized way. For example i want do uniqueness check of id, Auto DQ rule will generate following query:

WITH
    `3f00a224-2204-4ae5-bd7b-2ffa70afc926` AS (
        SELECT
            *
        FROM
            `my-project.my_dataset.my_table`


SELECT
    *
FROM
    `3f00a224-2204-4ae5-bd7b-2ffa70afc926`
WHERE
    `id` IN (
        SELECT
            `id`
        FROM
            `3f00a224-2204-4ae5-bd7b-2ffa70afc926`
        GROUP BY
            `id`
        HAVING
            COUNT(`id`) > 1);

in my case I would like not to do "select *", instead select only id what's implicit lower costs.

SELECT
    id
FROM
    `3f00a224-2204-4ae5-bd7b-2ffa70afc926`
WHERE
    `id` IN (
        SELECT
            `id`
        FROM
            `3f00a224-2204-4ae5-bd7b-2ffa70afc926`
        GROUP BY
            `id`
        HAVING
            COUNT(`id`) > 1);

Is this achievable ?

kolban

Howdy. This is a great question. I had a go at trying to recreate. I created a table and defined a Data Quality scan. I too saw the `select *` as the query. However, when I looked at the BigQuery jobs that my project ran, I could not find any trace of its execution. What that implies to me is that the cost of running the BigQuery scan is absorbed by Dataplex. However, after considering that for a bit, I then realized that Google passes on the cost of using Dataplex to the consumer ... which implies a more efficient query should result in a lower cost.

I have created a report to have engineering examine. The internal id of the report is 325939970 (in case someone needs to refer to it). Many thanks for bringing this up.

javi4

Hi,

One year has elapsed since this post but Dataplex query generation seems to still have the this problem.

Was there some development in regards of query optimization? I see that recent SQL assertion rule type is also based on the selection of all the columns, which may greatly increase the computation from end users.

Thanks