Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataplex Quality Generated Query Optimization

Hi,

I am using Dataplex to execute data quality controls over my data but I am concerned about its resource consumption as the generated queries don't seem to be as optimized as possible.

When executing a Completeness control the autogenerated query:

WITH `f770f458-ec9a-407f-8228-d79bdebfa8d9` AS (SELECT * FROM `my_table` ) SELECT * FROM `f770f458-ec9a-407f-8228-d79bdebfa8d9` WHERE (`my_field` IS NULL);

 selects all the columns which may decrease greatly the performance (in my case selecting only my_field would consume 1% of the process above).

I had the same observations for the rest of the rule types, i.e. for Row Condition Check:

WITH `4ece6e96-41e0-4e30-9912-60740a25af5d` AS (SELECT * FROM `my_table` ) SELECT * FROM `4ece6e96-41e0-4e30-9912-60740a25af5d` WHERE (NOT(('my_field1' < 'my_field2') IS TRUE));

, for regex 

WITH `f770f458-ec9a-407f-8228-d79bdebfa8d9` AS (SELECT * FROM `my_table` ) SELECT * FROM `f770f458-ec9a-407f-8228-d79bdebfa8d9` WHERE (NOT((REGEXP_CONTAINS(`my_field`, r'^.{5}$')) IS TRUE));

, etc. 

I tried to search in the documentation but I did not find much information regarding the manage of this topic, it seems that there was some examination as per https://www.googlecloudcommunity.com/gc/Data-Analytics/Dataplex-Quality-scan-cost-optimizing/m-p/712..., but I'm not sure where to find further details about this.

Perhaps I am missing some query optimization configuration (unfortunately I did not find details of this topic in the release notes during the last year)? Could you please provide me more information about it? That would be extremely helpful for me.

Thanks, regards,

Javier 

 

 

0 1 201
1 REPLY 1