Solved: Dataplex quality scan rules involving more than on... - Page 2

vzugadi · 02-25-2024 07:58 AM

is there currently any way to define a quality rule in dataplex that involves more than one table, something like

uuid IN (select uuid FROM dataset.dimensional_table)

ms4446

Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:

Overall Workflow

Cloud Scheduler: Set up recurring data quality scans using Cloud Scheduler. This automates the process, ensuring checks run at desired intervals.
Cloud Run Job:
- Language Flexibility: Python is a strong choice due to its robust support for Google Cloud services, but other Cloud Run supported languages are viable.
- Logic Implementation:
  - Connect to BigQuery using the relevant Google Cloud client library.
  - Execute cross-table data quality checks (e.g., IN subqueries, JOINs).
  - Format results according to the schema of your A_scan_results table.
  - Store formatted results in the A_scan_results table for monitoring.
Dataplex Integration: While Dataplex doesn't natively use external scan results, the A_scan_results table acts as a central repository. Consider custom dashboards or other governance tools for analysis.

Key Considerations

A_scan_results Table Schema: Carefully design the schema to effectively store results. Consider columns like:
- timestamp
- data_location
- rule_type
- rule_description
- result
Error Handling: Implement robust error handling in your Cloud Run job. Logging and alerting will speed up troubleshooting.
Authentication and Permissions: The Cloud Run job needs appropriate BigQuery permissions (read data tables, write to A_scan_results).
Result Formatting: Invest in consistent formatting for compatibility with A_scan_results and to enhance the clarity of insights.

Conceptual Example

Let's assume a simple schema for A_scan_results:

 
timestamp | rule_name | table_location | result

Here's a possible Python snippet for your Cloud Run job:

 

from google.cloud import bigquery 
from datetime import datetime 

client = bigquery.Client() 

# Example query:
query = """ 
    SELECT COUNT(*) AS failed_count 
    FROM `project.dataset.A` 
    WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`) 
""" 

result = client.query(query).result().to_dataframe() 

if result['failed_count'][0] > 0: 
    validation_result = { 
        'timestamp': datetime.now(), 
        'rule_name': 'UUID Dimensional Table Check', 
        'table_location': 'project.dataset.A', 
        'result': 'FAIL' 
    } 
    client.insert_rows_json('project.dataset.A_scan_results', [validation_result]) 

 

View solution in original post

Dataplex quality scan rules involving more than one table