Solved: Dataplex quality scan rules involving more than on...

vzugadi · 02-25-2024 07:58 AM

is there currently any way to define a quality rule in dataplex that involves more than one table, something like

uuid IN (select uuid FROM dataset.dimensional_table)

ms4446

Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:

Overall Workflow

Cloud Scheduler: Set up recurring data quality scans using Cloud Scheduler. This automates the process, ensuring checks run at desired intervals.
Cloud Run Job:
- Language Flexibility: Python is a strong choice due to its robust support for Google Cloud services, but other Cloud Run supported languages are viable.
- Logic Implementation:
  - Connect to BigQuery using the relevant Google Cloud client library.
  - Execute cross-table data quality checks (e.g., IN subqueries, JOINs).
  - Format results according to the schema of your A_scan_results table.
  - Store formatted results in the A_scan_results table for monitoring.
Dataplex Integration: While Dataplex doesn't natively use external scan results, the A_scan_results table acts as a central repository. Consider custom dashboards or other governance tools for analysis.

Key Considerations

A_scan_results Table Schema: Carefully design the schema to effectively store results. Consider columns like:
- timestamp
- data_location
- rule_type
- rule_description
- result
Error Handling: Implement robust error handling in your Cloud Run job. Logging and alerting will speed up troubleshooting.
Authentication and Permissions: The Cloud Run job needs appropriate BigQuery permissions (read data tables, write to A_scan_results).
Result Formatting: Invest in consistent formatting for compatibility with A_scan_results and to enhance the clarity of insights.

Conceptual Example

Let's assume a simple schema for A_scan_results:

 
timestamp | rule_name | table_location | result

Here's a possible Python snippet for your Cloud Run job:

 

from google.cloud import bigquery 
from datetime import datetime 

client = bigquery.Client() 

# Example query:
query = """ 
    SELECT COUNT(*) AS failed_count 
    FROM `project.dataset.A` 
    WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`) 
""" 

result = client.query(query).result().to_dataframe() 

if result['failed_count'][0] > 0: 
    validation_result = { 
        'timestamp': datetime.now(), 
        'rule_name': 'UUID Dimensional Table Check', 
        'table_location': 'project.dataset.A', 
        'result': 'FAIL' 
    } 
    client.insert_rows_json('project.dataset.A_scan_results', [validation_result]) 

 

View solution in original post

ms4446

Unfortunately, Dataplex Data Quality rules do not natively support defining rules that involve subqueries spanning multiple tables, such as the example you've provided. Here's an overview of the limitations and some potential workarounds:

Limitations

Single Table Focus: Dataplex's Data Quality rules are designed primarily for operations within a single table. The sql_expression in a rule is expected to execute in the context of the specific table being scanned.
Subquery Scope: Although subqueries are technically supported, their use is constrained to operations that do not extend beyond the primary table targeted by the Data Quality scan.

Workarounds

Data Preparation (ETL / Materialized Views):
- Pre-process your data using an ETL (Extract, Transform, Load) process to amalgamate necessary information from multiple tables into a new, denormalized table or a materialized view. This pre-processed dataset can then be directly scanned by Dataplex.
- Simplify Dataplex rules by using the combined data, which avoids the need for cross-table validation within a single rule.
External Validation + Scripting:
- Leverage external scripting or programming environments (e.g., Python, Java) to connect to your data sources (such as BigQuery) and execute complex cross-table validation logic.
- Log validation results into a table that Dataplex can monitor. Note that this approach does not integrate real-time quality rules within Dataplex but can be used for periodic data quality assessments.

Important Considerations

Performance: The suggested workarounds, especially those involving complex queries or large datasets, may impact performance. It's important to optimize your data processing workflows to mitigate this.
Data Freshness: When using data preparation methods like ETL processes or materialized views, ensure they are regularly updated to align with your Dataplex scan schedules. This alignment is crucial for maintaining the accuracy of your data quality results.

AmolKhainar

*Single Table Focus* - still this limitation persist? if yes, would be any future possibilities of involving multiple tables in Quality rule custom SQL?

vzugadi

Thank you for your detailed answer, I'm figuring out how to log validation results from an external execution environment, for example:

I've a table A in BigQuery and also another one named A_scan_results where dataplex's scan results are stored, with for example, google scheduler, periodicaly, I trigger a cloud run job where I perform the cross table data quality scans for the table A, the scans results should be stored in the A_scan_results table to be integrated with dataplex ?

ms4446

Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:

Overall Workflow

Cloud Scheduler: Set up recurring data quality scans using Cloud Scheduler. This automates the process, ensuring checks run at desired intervals.
Cloud Run Job:
- Language Flexibility: Python is a strong choice due to its robust support for Google Cloud services, but other Cloud Run supported languages are viable.
- Logic Implementation:
  - Connect to BigQuery using the relevant Google Cloud client library.
  - Execute cross-table data quality checks (e.g., IN subqueries, JOINs).
  - Format results according to the schema of your A_scan_results table.
  - Store formatted results in the A_scan_results table for monitoring.
Dataplex Integration: While Dataplex doesn't natively use external scan results, the A_scan_results table acts as a central repository. Consider custom dashboards or other governance tools for analysis.

Key Considerations

A_scan_results Table Schema: Carefully design the schema to effectively store results. Consider columns like:
- timestamp
- data_location
- rule_type
- rule_description
- result
Error Handling: Implement robust error handling in your Cloud Run job. Logging and alerting will speed up troubleshooting.
Authentication and Permissions: The Cloud Run job needs appropriate BigQuery permissions (read data tables, write to A_scan_results).
Result Formatting: Invest in consistent formatting for compatibility with A_scan_results and to enhance the clarity of insights.

Conceptual Example

Let's assume a simple schema for A_scan_results:

 
timestamp | rule_name | table_location | result

Here's a possible Python snippet for your Cloud Run job:

 

from google.cloud import bigquery 
from datetime import datetime 

client = bigquery.Client() 

# Example query:
query = """ 
    SELECT COUNT(*) AS failed_count 
    FROM `project.dataset.A` 
    WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`) 
""" 

result = client.query(query).result().to_dataframe() 

if result['failed_count'][0] > 0: 
    validation_result = { 
        'timestamp': datetime.now(), 
        'rule_name': 'UUID Dimensional Table Check', 
        'table_location': 'project.dataset.A', 
        'result': 'FAIL' 
    } 
    client.insert_rows_json('project.dataset.A_scan_results', [validation_result]) 

 

vzugadi

I'm trying to figure out how to Log validation results from external scripting or programming environment

For example, I've a table A in bigquery and a dataplex quality scan that stores scan results on the bigquery table A_scan_results.

I trigger every month a cross table data quality scan for the same A table, this quality scan is executed in a cloudrun service or a google function. I store the results of my custom cross table quality scan in the same bigquery table A_scan_results and I get my results integrated with dataplex.

is this a good approach ?

ms4446

Yes, this is a good approach! It aligns well with Dataplex and adds value to your data quality process.

Some Strengths to your approach:

Centralized Monitoring: By aggregating the results of your external validations in the A_scan_results table, you create a consolidated dashboard for monitoring data quality. This central repository facilitates an integrated view alongside Dataplex's native Data Quality scans, enhancing oversight and analysis.
Trend Analysis: Storing historical validation results enables comprehensive tracking of data quality trends for Table A over time. This capability is invaluable for identifying patterns, making informed decisions, and implementing proactive data quality improvements, even for validations beyond Dataplex's built-in capabilities.
Flexibility and Customization: Utilizing Cloud Run or Cloud Functions allows for the execution of complex, custom validation logic. This flexibility supports the implementation of sophisticated data quality checks that might not be directly feasible within Dataplex's rule engine.
Automated Scheduling: Leveraging Cloud Scheduler ensures that these critical data quality checks are executed consistently and automatically on a monthly basis, removing manual overhead and ensuring regular data quality assessments.

Considerations

Schema Design and Compatibility: Carefully design the A_scan_results table schema to ensure compatibility with Dataplex's data quality result interpretation. A well-thought-out schema facilitates seamless integration and optimizes the utility of the Dataplex interface for result visualization and analysis.
Robust Error Handling: Develop comprehensive error handling within your Cloud Run or Cloud Function implementations. Effective error management is crucial for identifying, logging, and addressing validation issues, enhancing the reliability of your data quality framework.
Optional Dataplex Integration: Consider leveraging Dataplex as an orchestrator for your data quality workflows. If your validation schedule can accommodate flexibility, integrating your Cloud Run job or Cloud Function with Dataplex Tasks could streamline operations, embedding your custom checks within broader data processing workflows.

Tips

Result Granularity and Schema Support: Decide on the desired granularity for your validation results (e.g., detailed row-level issues, summary statistics) and ensure the A_scan_results schema is designed to accommodate this level of detail.
Authentication and Permissions: Ensure proper authentication mechanisms are in place for your Cloud Run or Cloud Function, enabling secure access to BigQuery for result logging. Typically, this involves configuring a service account with the necessary BigQuery permissions (Data Editor, Data Viewer) to interact with the A_scan_results table.

covaikumar99

@ms @vzugadi

I was going through Dataplex documentation and I found this:

under Custom rules types:

https://cloud.google.com/dataplex/docs/auto-data-quality-overview#supported-custom-sql-rule-types

(SELECT COUNT(*) FROM `example_project.example_dataset.different-table`) < COUNT(*)

The expression can include a reference to another table, for example, to create referential integrity checks.

I have not tried this on my end.

covaikumar99

@ms4446 👆COUld you please look into the above post and provide your inputs ?

Dataplex quality scan rules involving more than one table