is there currently any way to define a quality rule in dataplex that involves more than one table, something like
uuid IN (select uuid FROM dataset.dimensional_table)
Solved! Go to Solution.
Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:
Overall Workflow
A_scan_results
table.A_scan_results
table for monitoring.A_scan_results
table acts as a central repository. Consider custom dashboards or other governance tools for analysis.Key Considerations
A_scan_results
Table Schema: Carefully design the schema to effectively store results. Consider columns like:
timestamp
data_location
rule_type
rule_description
result
A_scan_results
).A_scan_results
and to enhance the clarity of insights.Conceptual Example
Let's assume a simple schema for A_scan_results
:
timestamp | rule_name | table_location | result
Here's a possible Python snippet for your Cloud Run job:
from google.cloud import bigquery
from datetime import datetime
client = bigquery.Client()
# Example query:
query = """
SELECT COUNT(*) AS failed_count
FROM `project.dataset.A`
WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`)
"""
result = client.query(query).result().to_dataframe()
if result['failed_count'][0] > 0:
validation_result = {
'timestamp': datetime.now(),
'rule_name': 'UUID Dimensional Table Check',
'table_location': 'project.dataset.A',
'result': 'FAIL'
}
client.insert_rows_json('project.dataset.A_scan_results', [validation_result])