Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How Dataplex Data Quality TableConditionExpectation works behind scenes ?

Hi im trying to set up a custom query in Dataplex Data Quality Scans like the following:

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
    "data": {
        "resource": "my-resource"
    },
    "data_quality_spec": {
"row_filter" : "dt = '2023-11-01' AND hr = '01'", "rules": [ { "description": "", "dimension": "CONSISTENCY", "name": "test", "table_condition_expectation": { "sql_expression": "(SELECT SUM(revenue) > 2000 FROM `data-production-01.dataset.table_test`)" } } ] } }
 
 

  My question is the following, how this partial sql_expression is actually converted into a full query? Is translated as a subquery ?  

Im asking because I want to create a sql_expression that checks the number of rows of the row_filter partition against a previous partition, and im not sure how dataplex is going to interpret it. 

EDIT: link to Table condition expectation: 

0 3 1,312
3 REPLIES 3

In Dataplex Data Quality enables you to define custom SQL expressions for validating data quality against specific conditions within a table. These expressions are evaluated within the context of a specified row filter, ensuring that the validation applies only to the relevant subset of data.

Contextual Evaluation and Expression Design:

  • The sql_expression should be contextually relevant to the data subset defined by the row_filter.

  • Design the sql_expression to accurately assess data quality based on the specified conditions.

  • Use appropriate aggregate functions, joins, or other SQL features as needed for your data quality checks.

Result Expectation and Flexibility:

  • The sql_expression should return a boolean result (TRUE or FALSE) indicating whether the data meets the specified quality criteria.

  • TableConditionExpectation offers flexibility for a wide range of data quality checks, from simple validations to complex comparisons.

Testing, Documentation, and Community Resources:

  • Thoroughly test your sql_expression in a separate SQL environment before using it in Dataplex.

  • Consult community forums, user guides, and other resources for additional insights and practical examples.

Leveraging TableConditionExpectation Effectively:

  • Carefully plan and test your custom SQL expressions to ensure they produce accurate and reliable results.

  • Tailor your expressions to your specific data characteristics and quality requirements.

By considering these aspects, you can effectively utilize TableConditionExpectation in Dataplex Data Quality to perform robust and customized data quality validations.

Can be used dataplex quality data scan to test if a given value, for example an uuid, is in the uuid colonne of another table ?

You can leverage Dataplex Data Quality scans to verify the presence of a specific UUID within the UUID column of another table. Here's how to approach this:

Method: Custom SQL Expression within Data Quality Rule

  • Initiate a Data Quality Scan: Start by creating a new Data Quality scan in Dataplex, targeting the dataset or table you wish to apply the validation rule to.
  • Define a Data Quality Rule: When setting up your data quality rules, you'll utilize a custom SQL expression to articulate the condition you're checking for. This is done within the framework of a TableConditionExpectation or a similar mechanism that allows for the execution of custom SQL queries.
  • Craft the SQL Expression: Your SQL expression should be designed to check for the existence of the UUID in the target table's column. An example expression might look like this:
EXISTS (SELECT 1 FROM `other_project.other_dataset.other_table` WHERE uuid_column = 'your_uuid_value') 
  • Replace other_project.other_dataset.other_table with the actual path to the table you're checking against.
  • Replace 'your_uuid_value' with the specific UUID you're verifying.

Considerations for Rule Configuration:

  • Since you're checking for the existence of a specific value, the concept of a "passing threshold" typically used for percentage-based validations does not directly apply. The rule is binary; it passes if the condition is met (TRUE if the UUID exists).

Operational Considerations:

  • Permissions: Ensure that the Dataplex service account or the account executing the Data Quality scan has the necessary permissions to access both the source and target datasets/tables.
  • Performance Optimization: For large datasets, consider performance implications. Indexing the uuid_column in the target table can improve the efficiency of the query.
  • Data Type Compatibility: Ensure that the UUID columns in both the source and target tables are compatible in terms of data type to avoid mismatches or errors during validation.

Additional Considerations:

  • Dynamic UUID Handling: If the UUID you're checking is not static, consider mechanisms to dynamically inject the UUID value into the SQL expression for each scan execution.
  • Testing and Validation: It's crucial to test your SQL expression in a separate SQL environment to ensure it behaves as expected before incorporating it into a Dataplex Data Quality scan.