Hi,
I am using Dataplex to execute data quality controls over my data but I am concerned about its resource consumption as the generated queries don't seem to be as optimized as possible.
When executing a Completeness control the autogenerated query:
WITH `f770f458-ec9a-407f-8228-d79bdebfa8d9` AS (SELECT * FROM `my_table` ) SELECT * FROM `f770f458-ec9a-407f-8228-d79bdebfa8d9` WHERE (`my_field` IS NULL);
selects all the columns which may decrease greatly the performance (in my case selecting only my_field would consume 1% of the process above).
I had the same observations for the rest of the rule types, i.e. for Row Condition Check:
WITH `4ece6e96-41e0-4e30-9912-60740a25af5d` AS (SELECT * FROM `my_table` ) SELECT * FROM `4ece6e96-41e0-4e30-9912-60740a25af5d` WHERE (NOT(('my_field1' < 'my_field2') IS TRUE));
, for regex
WITH `f770f458-ec9a-407f-8228-d79bdebfa8d9` AS (SELECT * FROM `my_table` ) SELECT * FROM `f770f458-ec9a-407f-8228-d79bdebfa8d9` WHERE (NOT((REGEXP_CONTAINS(`my_field`, r'^.{5}$')) IS TRUE));
, etc.
I tried to search in the documentation but I did not find much information regarding the manage of this topic, it seems that there was some examination as per https://www.googlecloudcommunity.com/gc/Data-Analytics/Dataplex-Quality-scan-cost-optimizing/m-p/712..., but I'm not sure where to find further details about this.
Perhaps I am missing some query optimization configuration (unfortunately I did not find details of this topic in the release notes during the last year)? Could you please provide me more information about it? That would be extremely helpful for me.
Thanks, regards,
Javier
Hi javi4,
Welcome to the Google Cloud Community!
It seems that Dataplex's auto-generated data quality queries currently use the SELECT * approach, which may not always be the most efficient. Although direct modifications to these queries are not possible, you can use Custom SQL rules, where you can create your own optimized queries that only select the necessary columns.
Additionally, you may consider using views to pre-select the relevant columns. By creating a view that only selects the columns needed for the data quality checks. Enabling partitioning and clustering on your BigQuery tables could also be beneficial, as these techniques help BigQuery more efficiently scan and process your data. Before deploying to the whole dataset, consider testing these quality controls to a sample of data to estimate its cost.
It would also be helpful to regularly monitor costs and keep an eye on updates from the Dataplex release notes, as future releases may include additional optimization features.
For more detailed information about Dataplex auto data quality, you can refer to this documentation.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.