Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataplex quality scan rules involving more than one table

is there currently any way to define a quality rule in dataplex that involves more than one table, something like

 

uuid IN (select uuid FROM dataset.dimensional_table)

 

 

Solved Solved
0 8 1,713
1 ACCEPTED SOLUTION

Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:

Overall Workflow

  • Cloud Scheduler: Set up recurring data quality scans using Cloud Scheduler. This automates the process, ensuring checks run at desired intervals.
  • Cloud Run Job:
    • Language Flexibility: Python is a strong choice due to its robust support for Google Cloud services, but other Cloud Run supported languages are viable.
    • Logic Implementation:
      • Connect to BigQuery using the relevant Google Cloud client library.
      • Execute cross-table data quality checks (e.g., IN subqueries, JOINs).
      • Format results according to the schema of your A_scan_results table.
      • Store formatted results in the A_scan_results table for monitoring.
  • Dataplex Integration: While Dataplex doesn't natively use external scan results, the A_scan_results table acts as a central repository. Consider custom dashboards or other governance tools for analysis.

Key Considerations

  • A_scan_results Table Schema: Carefully design the schema to effectively store results. Consider columns like:
    • timestamp
    • data_location
    • rule_type
    • rule_description
    • result
  • Error Handling: Implement robust error handling in your Cloud Run job. Logging and alerting will speed up troubleshooting.
  • Authentication and Permissions: The Cloud Run job needs appropriate BigQuery permissions (read data tables, write to A_scan_results).
  • Result Formatting: Invest in consistent formatting for compatibility with A_scan_results and to enhance the clarity of insights.

Conceptual Example

Let's assume a simple schema for A_scan_results:

 
timestamp | rule_name | table_location | result 

Here's a possible Python snippet for your Cloud Run job:

 
from google.cloud import bigquery 
from datetime import datetime 

client = bigquery.Client() 

# Example query:
query = """ 
    SELECT COUNT(*) AS failed_count 
    FROM `project.dataset.A` 
    WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`) 
""" 

result = client.query(query).result().to_dataframe() 

if result['failed_count'][0] > 0: 
    validation_result = { 
        'timestamp': datetime.now(), 
        'rule_name': 'UUID Dimensional Table Check', 
        'table_location': 'project.dataset.A', 
        'result': 'FAIL' 
    } 
    client.insert_rows_json('project.dataset.A_scan_results', [validation_result]) 

View solution in original post

8 REPLIES 8