is there currently any way to define a quality rule in dataplex that involves more than one table, something like
uuid IN (select uuid FROM dataset.dimensional_table)
Solved! Go to Solution.
Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:
Overall Workflow
A_scan_results
table.A_scan_results
table for monitoring.A_scan_results
table acts as a central repository. Consider custom dashboards or other governance tools for analysis.Key Considerations
A_scan_results
Table Schema: Carefully design the schema to effectively store results. Consider columns like:
timestamp
data_location
rule_type
rule_description
result
A_scan_results
).A_scan_results
and to enhance the clarity of insights.Conceptual Example
Let's assume a simple schema for A_scan_results
:
timestamp | rule_name | table_location | result
Here's a possible Python snippet for your Cloud Run job:
from google.cloud import bigquery
from datetime import datetime
client = bigquery.Client()
# Example query:
query = """
SELECT COUNT(*) AS failed_count
FROM `project.dataset.A`
WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`)
"""
result = client.query(query).result().to_dataframe()
if result['failed_count'][0] > 0:
validation_result = {
'timestamp': datetime.now(),
'rule_name': 'UUID Dimensional Table Check',
'table_location': 'project.dataset.A',
'result': 'FAIL'
}
client.insert_rows_json('project.dataset.A_scan_results', [validation_result])
Unfortunately, Dataplex Data Quality rules do not natively support defining rules that involve subqueries spanning multiple tables, such as the example you've provided. Here's an overview of the limitations and some potential workarounds:
Limitations
Single Table Focus: Dataplex's Data Quality rules are designed primarily for operations within a single table. The sql_expression
in a rule is expected to execute in the context of the specific table being scanned.
Subquery Scope: Although subqueries are technically supported, their use is constrained to operations that do not extend beyond the primary table targeted by the Data Quality scan.
Workarounds
Data Preparation (ETL / Materialized Views):
Pre-process your data using an ETL (Extract, Transform, Load) process to amalgamate necessary information from multiple tables into a new, denormalized table or a materialized view. This pre-processed dataset can then be directly scanned by Dataplex.
Simplify Dataplex rules by using the combined data, which avoids the need for cross-table validation within a single rule.
External Validation + Scripting:
Leverage external scripting or programming environments (e.g., Python, Java) to connect to your data sources (such as BigQuery) and execute complex cross-table validation logic.
Log validation results into a table that Dataplex can monitor. Note that this approach does not integrate real-time quality rules within Dataplex but can be used for periodic data quality assessments.
Important Considerations
Performance: The suggested workarounds, especially those involving complex queries or large datasets, may impact performance. It's important to optimize your data processing workflows to mitigate this.
Data Freshness: When using data preparation methods like ETL processes or materialized views, ensure they are regularly updated to align with your Dataplex scan schedules. This alignment is crucial for maintaining the accuracy of your data quality results.
*Single Table Focus* - still this limitation persist? if yes, would be any future possibilities of involving multiple tables in Quality rule custom SQL?
Thank you for your detailed answer, I'm figuring out how to log validation results from an external execution environment, for example:
I've a table A in BigQuery and also another one named A_scan_results where dataplex's scan results are stored, with for example, google scheduler, periodicaly, I trigger a cloud run job where I perform the cross table data quality scans for the table A, the scans results should be stored in the A_scan_results table to be integrated with dataplex ?
Implementing cross-table data quality scans within the Google Cloud ecosystem is a strategic approach. Below is an enhanced breakdown of how to execute this effectively, along with key considerations:
Overall Workflow
A_scan_results
table.A_scan_results
table for monitoring.A_scan_results
table acts as a central repository. Consider custom dashboards or other governance tools for analysis.Key Considerations
A_scan_results
Table Schema: Carefully design the schema to effectively store results. Consider columns like:
timestamp
data_location
rule_type
rule_description
result
A_scan_results
).A_scan_results
and to enhance the clarity of insights.Conceptual Example
Let's assume a simple schema for A_scan_results
:
timestamp | rule_name | table_location | result
Here's a possible Python snippet for your Cloud Run job:
from google.cloud import bigquery
from datetime import datetime
client = bigquery.Client()
# Example query:
query = """
SELECT COUNT(*) AS failed_count
FROM `project.dataset.A`
WHERE uuid NOT IN (SELECT uuid FROM `project.dataset.dimensional_table`)
"""
result = client.query(query).result().to_dataframe()
if result['failed_count'][0] > 0:
validation_result = {
'timestamp': datetime.now(),
'rule_name': 'UUID Dimensional Table Check',
'table_location': 'project.dataset.A',
'result': 'FAIL'
}
client.insert_rows_json('project.dataset.A_scan_results', [validation_result])
I'm trying to figure out how to Log validation results from external scripting or programming environment
For example, I've a table A in bigquery and a dataplex quality scan that stores scan results on the bigquery table A_scan_results.
I trigger every month a cross table data quality scan for the same A table, this quality scan is executed in a cloudrun service or a google function. I store the results of my custom cross table quality scan in the same bigquery table A_scan_results and I get my results integrated with dataplex.
is this a good approach ?
Yes, this is a good approach! It aligns well with Dataplex and adds value to your data quality process.
Some Strengths to your approach:
Centralized Monitoring: By aggregating the results of your external validations in the A_scan_results
table, you create a consolidated dashboard for monitoring data quality. This central repository facilitates an integrated view alongside Dataplex's native Data Quality scans, enhancing oversight and analysis.
Trend Analysis: Storing historical validation results enables comprehensive tracking of data quality trends for Table A over time. This capability is invaluable for identifying patterns, making informed decisions, and implementing proactive data quality improvements, even for validations beyond Dataplex's built-in capabilities.
Flexibility and Customization: Utilizing Cloud Run or Cloud Functions allows for the execution of complex, custom validation logic. This flexibility supports the implementation of sophisticated data quality checks that might not be directly feasible within Dataplex's rule engine.
Automated Scheduling: Leveraging Cloud Scheduler ensures that these critical data quality checks are executed consistently and automatically on a monthly basis, removing manual overhead and ensuring regular data quality assessments.
Considerations
Schema Design and Compatibility: Carefully design the A_scan_results
table schema to ensure compatibility with Dataplex's data quality result interpretation. A well-thought-out schema facilitates seamless integration and optimizes the utility of the Dataplex interface for result visualization and analysis.
Robust Error Handling: Develop comprehensive error handling within your Cloud Run or Cloud Function implementations. Effective error management is crucial for identifying, logging, and addressing validation issues, enhancing the reliability of your data quality framework.
Optional Dataplex Integration: Consider leveraging Dataplex as an orchestrator for your data quality workflows. If your validation schedule can accommodate flexibility, integrating your Cloud Run job or Cloud Function with Dataplex Tasks could streamline operations, embedding your custom checks within broader data processing workflows.
Tips
Result Granularity and Schema Support: Decide on the desired granularity for your validation results (e.g., detailed row-level issues, summary statistics) and ensure the A_scan_results
schema is designed to accommodate this level of detail.
Authentication and Permissions: Ensure proper authentication mechanisms are in place for your Cloud Run or Cloud Function, enabling secure access to BigQuery for result logging. Typically, this involves configuring a service account with the necessary BigQuery permissions (Data Editor, Data Viewer) to interact with the A_scan_results
table.
I was going through Dataplex documentation and I found this:
under Custom rules types:
(SELECT COUNT(*) FROM `example_project.example_dataset.different-table`) < COUNT(*)
The expression can include a reference to another table, for example, to create referential integrity checks.
I have not tried this on my end.
@ms4446 👆COUld you please look into the above post and provide your inputs ?