Can Dataplex read/scan assets in a different GCP Org? Specifically BQ datasets in Org B if I've configured my lake from Org A? I would reason that I could, afterall, I can create an asset to an AWS S3 source. But when I try to (interactively) create such an Asset I receive 'The project of the referenced resource 'xyz' is not in the same organization as the asset.'
Solved! Go to Solution.
In Dataplex, you cannot directly scan or manage BigQuery assets located in a different GCP organization. Dataplex enforces organizational boundaries for security and isolation, which means it does not support direct access to BigQuery datasets across different organizations. This is why you encountered the error stating that the referenced resource is not in the same organization as the asset.
Google Cloud enforces strict boundaries between organizations to maintain security and isolation of resources. This restricts Dataplex from directly accessing BigQuery datasets in another organization. Dataplex relies on IAM for access control. Permissions granted to Dataplex service accounts are scoped within the organization where the account was created. Dataplex cannot authenticate or authorize access to BigQuery datasets in another organization using its service account.
Alternative Solutions
1. Use BigQuery Views:
BigQuery Data Viewer
) on the dataset in Org B to access the view.2. Use BigQuery Data Transfer Service:
3. Data Replication via ETL Tools:
If you need to scan a dataset orgb_project.sales_data
in Org B using Dataplex in Org A:
CREATE OR REPLACE VIEW `orga_project.sales_view.sales_data` AS
SELECT * FROM `orgb_project.sales_data.table_name`;
BigQuery Data Viewer
to the Dataplex service account from Org A on the dataset in Org B.orga_project.sales_view.sales_data
as an asset in Dataplex in Org A.while Dataplex cannot directly access BigQuery assets in a different organization, you can use views or data transfer methods to work around this limitation by effectively bringing the data into the same organization as your Dataplex setup.
In Dataplex, you cannot directly scan or manage BigQuery assets located in a different GCP organization. Dataplex enforces organizational boundaries for security and isolation, which means it does not support direct access to BigQuery datasets across different organizations. This is why you encountered the error stating that the referenced resource is not in the same organization as the asset.
Google Cloud enforces strict boundaries between organizations to maintain security and isolation of resources. This restricts Dataplex from directly accessing BigQuery datasets in another organization. Dataplex relies on IAM for access control. Permissions granted to Dataplex service accounts are scoped within the organization where the account was created. Dataplex cannot authenticate or authorize access to BigQuery datasets in another organization using its service account.
Alternative Solutions
1. Use BigQuery Views:
BigQuery Data Viewer
) on the dataset in Org B to access the view.2. Use BigQuery Data Transfer Service:
3. Data Replication via ETL Tools:
If you need to scan a dataset orgb_project.sales_data
in Org B using Dataplex in Org A:
CREATE OR REPLACE VIEW `orga_project.sales_view.sales_data` AS
SELECT * FROM `orgb_project.sales_data.table_name`;
BigQuery Data Viewer
to the Dataplex service account from Org A on the dataset in Org B.orga_project.sales_view.sales_data
as an asset in Dataplex in Org A.while Dataplex cannot directly access BigQuery assets in a different organization, you can use views or data transfer methods to work around this limitation by effectively bringing the data into the same organization as your Dataplex setup.
Thanks for the clear and direct answer. It would be great if Dataplex docs made the same-org aspect more obvious upfront. I read your response as 'Dataplex can read from other orgs... but only using external methods'. (1) seems the most appealing, but also comes with significant overheads in the form of creating those views and then maintaining them (ie, understanding whether more datasets are later added or orgb. In this regard I suppose one of the views could include orgb's various INFORMATION_SCHEMAs).
Would you propose the same approaches for orgb's blob storage?
I've also read about Integration Connectors that seem like an option.
Hello ms4446,
Please correct me if I'm wrong, but currently, DataPlex only scans physical tables within the BigQuery dataset, not views. I am specifically referring to establishing lakes/zones and adding datasets that contain views/auth views. Do you have any recommendation regarding handling views in particular using DataPlex?
Thanks in advance!
Dataplex primarily scans physical tables within BigQuery datasets, posing a challenge when working with views. To effectively manage views within Dataplex, several strategies can be employed to bridge this gap.
Materialized Views provide a straightforward solution by converting views into precomputed tables that Dataplex can scan. These views can be scheduled for regular refreshes to ensure data remains up-to-date. For example, creating a materialized view in BigQuery is simple and ensures that the data is readily accessible for Dataplex scanning:
CREATE MATERIALIZED VIEW `orga_project.sales_data_mv` AS
SELECT * FROM `orgb_project.sales_data.table_name`;
Intermediate Tables can store the results of queries originally intended for views. These tables can be updated regularly using scheduling tools like Cloud Scheduler or Airflow, ensuring the data remains current without the direct use of views. For instance, an Airflow job can automate the process of refreshing these tables daily:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner':
'airflow',
'start_date': days_ago(1),
}
dag = DAG(
'update_intermediate_tables',
default_args=default_args,
description='A simple update DAG',
schedule_interval='@daily',
)
update_table = BashOperator(
task_id='update_sales_data',
bash_command='bq query --use_legacy_sql=false "INSERT INTO orga_project.sales_data_table SELECT * FROM orgb_project.sales_data.table_name"',
dag=dag,
)
update_table
For views that need to be referenced without direct scanning, Custom Metadata entries can be created within Dataplex. This approach documents the structure and dependencies of the views without integrating them into the scanning workflow. Using the Dataplex metadata API, you can define these entries clearly:
from google.cloud import dataplex_v1
client = dataplex_v1.MetadataServiceClient()
parent = client.zone_path('orga_project', 'lake_name', 'zone_name')
entity = {
"id": "sales_data_view",
"type": "CUSTOM",
"description": "A view representing sales data",
"schema": {
"columns": [
{"name": "column1", "type": "STRING"},
{"name": "column2", "type": "INT64"},
]
}
}
response = client.create_entity(parent=parent, entity=entity)
print(response)
Dataflow for On-Demand Queries offers a dynamic method to materialize views only when needed. This approach is particularly useful for large datasets where constant updates might be impractical. Dataflow can read from views and write results to tables, making them accessible to Dataplex without the need for continuous materialization.
Lastly, Scheduled Queries in BigQuery can automate the conversion of view results into tables. This method ensures that Dataplex can scan up-to-date data by periodically running queries to refresh the tables:
CREATE OR REPLACE TABLE `orga_project.sales_data_table`
AS SELECT * FROM `orgb_project.sales_data.table_name`;
While Dataplex does not directly support scanning views, these strategies—materialized views, intermediate tables, custom metadata, on-demand materialization with Dataflow, and scheduled queries—provide robust alternatives. These approaches ensure that data from views can be effectively managed and scanned within Dataplex, aligning with organizational needs and maintaining data accessibility.