Solved: Use Dataplex to scan BigQuery asset in different o...

matt_trojan · 06-21-2024 08:15 AM

Can Dataplex read/scan assets in a different GCP Org? Specifically BQ datasets in Org B if I've configured my lake from Org A? I would reason that I could, afterall, I can create an asset to an AWS S3 source. But when I try to (interactively) create such an Asset I receive 'The project of the referenced resource 'xyz' is not in the same organization as the asset.'

ms4446

In Dataplex, you cannot directly scan or manage BigQuery assets located in a different GCP organization. Dataplex enforces organizational boundaries for security and isolation, which means it does not support direct access to BigQuery datasets across different organizations. This is why you encountered the error stating that the referenced resource is not in the same organization as the asset.

Google Cloud enforces strict boundaries between organizations to maintain security and isolation of resources. This restricts Dataplex from directly accessing BigQuery datasets in another organization. Dataplex relies on IAM for access control. Permissions granted to Dataplex service accounts are scoped within the organization where the account was created. Dataplex cannot authenticate or authorize access to BigQuery datasets in another organization using its service account.

Alternative Solutions

1. Use BigQuery Views:

Create a View in Org A: You can create a BigQuery view in Org A that references the dataset in Org B. This view acts as a proxy to the dataset in Org B.
Grant Permissions: Ensure that the Dataplex service account in Org A has the necessary permissions (BigQuery Data Viewer) on the dataset in Org B to access the view.
Add the View to Dataplex: Add this view as an asset in Dataplex in Org A, allowing Dataplex to scan and manage it.

2. Use BigQuery Data Transfer Service:

Set Up Data Transfer: Use the BigQuery Data Transfer Service to regularly copy data from Org B’s dataset to a dataset in Org A.
Add the Transferred Dataset to Dataplex: Manage this dataset in Dataplex as a local asset within Org A.

3. Data Replication via ETL Tools:

Replicate Data: Use ETL tools like Dataflow to replicate data from BigQuery in Org B to BigQuery in Org A.
Manage in Org A: Once replicated, add the dataset to Dataplex in Org A.

Views in Org A act as a representation of data from Org B, allowing Dataplex to interact with the view without direct access to Org B’s dataset.
These methods bring the data into Org A, eliminating cross-organization access restrictions.

If you need to scan a dataset orgb_project.sales_data in Org B using Dataplex in Org A:

Create a View in Org A:

CREATE OR REPLACE VIEW `orga_project.sales_view.sales_data` AS
SELECT * FROM `orgb_project.sales_data.table_name`;

Grant Permissions in Org B: Grant BigQuery Data Viewer to the Dataplex service account from Org A on the dataset in Org B.
Add the View in Dataplex: Add orga_project.sales_view.sales_data as an asset in Dataplex in Org A.

while Dataplex cannot directly access BigQuery assets in a different organization, you can use views or data transfer methods to work around this limitation by effectively bringing the data into the same organization as your Dataplex setup.

View solution in original post

ms4446

In Dataplex, you cannot directly scan or manage BigQuery assets located in a different GCP organization. Dataplex enforces organizational boundaries for security and isolation, which means it does not support direct access to BigQuery datasets across different organizations. This is why you encountered the error stating that the referenced resource is not in the same organization as the asset.

Google Cloud enforces strict boundaries between organizations to maintain security and isolation of resources. This restricts Dataplex from directly accessing BigQuery datasets in another organization. Dataplex relies on IAM for access control. Permissions granted to Dataplex service accounts are scoped within the organization where the account was created. Dataplex cannot authenticate or authorize access to BigQuery datasets in another organization using its service account.

Alternative Solutions

1. Use BigQuery Views:

Create a View in Org A: You can create a BigQuery view in Org A that references the dataset in Org B. This view acts as a proxy to the dataset in Org B.
Grant Permissions: Ensure that the Dataplex service account in Org A has the necessary permissions (BigQuery Data Viewer) on the dataset in Org B to access the view.
Add the View to Dataplex: Add this view as an asset in Dataplex in Org A, allowing Dataplex to scan and manage it.

2. Use BigQuery Data Transfer Service:

Set Up Data Transfer: Use the BigQuery Data Transfer Service to regularly copy data from Org B’s dataset to a dataset in Org A.
Add the Transferred Dataset to Dataplex: Manage this dataset in Dataplex as a local asset within Org A.

3. Data Replication via ETL Tools:

Replicate Data: Use ETL tools like Dataflow to replicate data from BigQuery in Org B to BigQuery in Org A.
Manage in Org A: Once replicated, add the dataset to Dataplex in Org A.

Views in Org A act as a representation of data from Org B, allowing Dataplex to interact with the view without direct access to Org B’s dataset.
These methods bring the data into Org A, eliminating cross-organization access restrictions.

If you need to scan a dataset orgb_project.sales_data in Org B using Dataplex in Org A:

Create a View in Org A:

CREATE OR REPLACE VIEW `orga_project.sales_view.sales_data` AS
SELECT * FROM `orgb_project.sales_data.table_name`;

Grant Permissions in Org B: Grant BigQuery Data Viewer to the Dataplex service account from Org A on the dataset in Org B.
Add the View in Dataplex: Add orga_project.sales_view.sales_data as an asset in Dataplex in Org A.

while Dataplex cannot directly access BigQuery assets in a different organization, you can use views or data transfer methods to work around this limitation by effectively bringing the data into the same organization as your Dataplex setup.

matt_trojan

Thanks for the clear and direct answer. It would be great if Dataplex docs made the same-org aspect more obvious upfront. I read your response as 'Dataplex can read from other orgs... but only using external methods'. (1) seems the most appealing, but also comes with significant overheads in the form of creating those views and then maintaining them (ie, understanding whether more datasets are later added or orgb. In this regard I suppose one of the views could include orgb's various INFORMATION_SCHEMAs).

Would you propose the same approaches for orgb's blob storage?

I've also read about Integration Connectors that seem like an option.

KestutisD

Hello ms4446,

Please correct me if I'm wrong, but currently, DataPlex only scans physical tables within the BigQuery dataset, not views. I am specifically referring to establishing lakes/zones and adding datasets that contain views/auth views. Do you have any recommendation regarding handling views in particular using DataPlex?

Thanks in advance!

ms4446

Dataplex primarily scans physical tables within BigQuery datasets, posing a challenge when working with views. To effectively manage views within Dataplex, several strategies can be employed to bridge this gap.

Materialized Views provide a straightforward solution by converting views into precomputed tables that Dataplex can scan. These views can be scheduled for regular refreshes to ensure data remains up-to-date. For example, creating a materialized view in BigQuery is simple and ensures that the data is readily accessible for Dataplex scanning:

 
CREATE MATERIALIZED VIEW `orga_project.sales_data_mv` AS
SELECT * FROM `orgb_project.sales_data.table_name`;

Intermediate Tables can store the results of queries originally intended for views. These tables can be updated regularly using scheduling tools like Cloud Scheduler or Airflow, ensuring the data remains current without the direct use of views. For instance, an Airflow job can automate the process of refreshing these tables daily:

 
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago

default_args = {
    'owner': 
'airflow',
    'start_date': days_ago(1),
}

dag = DAG(
    'update_intermediate_tables',
    default_args=default_args,
    description='A simple update DAG',
    schedule_interval='@daily',
)

update_table = BashOperator(
    task_id='update_sales_data',
    bash_command='bq query --use_legacy_sql=false "INSERT INTO orga_project.sales_data_table SELECT * FROM orgb_project.sales_data.table_name"',
    dag=dag,
)

update_table  

 1.  www.codementor.io 
www.codementor.io

For views that need to be referenced without direct scanning, Custom Metadata entries can be created within Dataplex. This approach documents the structure and dependencies of the views without integrating them into the scanning workflow. Using the Dataplex metadata API, you can define these entries clearly:

 
from google.cloud import dataplex_v1

client = dataplex_v1.MetadataServiceClient()

parent = client.zone_path('orga_project', 'lake_name', 'zone_name')

entity = {
    "id": "sales_data_view",
    "type": "CUSTOM",
    "description": "A view representing sales data",
    "schema": {
        "columns": [
            {"name": "column1", "type": "STRING"},
            {"name": "column2", "type": "INT64"},
        ]
    }
}

response = client.create_entity(parent=parent, entity=entity)
print(response)

Dataflow for On-Demand Queries offers a dynamic method to materialize views only when needed. This approach is particularly useful for large datasets where constant updates might be impractical. Dataflow can read from views and write results to tables, making them accessible to Dataplex without the need for continuous materialization.

Lastly, Scheduled Queries in BigQuery can automate the conversion of view results into tables. This method ensures that Dataplex can scan up-to-date data by periodically running queries to refresh the tables:

 
CREATE OR REPLACE TABLE `orga_project.sales_data_table`
AS SELECT * FROM `orgb_project.sales_data.table_name`;

While Dataplex does not directly support scanning views, these strategies—materialized views, intermediate tables, custom metadata, on-demand materialization with Dataflow, and scheduled queries—provide robust alternatives. These approaches ensure that data from views can be effectively managed and scanned within Dataplex, aligning with organizational needs and maintaining data accessibility.

Use Dataplex to scan BigQuery asset in different org