Just kicking the tires here on Dataplex to see if it is the right fit. We are interested in its governance capabilities for the future, but right now we are interested to see if it offers any benefits for managing BigQuery transformations across multiple GCP projects. We were thinking that Dataplex might allow us to just point to some of the data in other projects that we have been duplicating due to source projects being in different regions. Does it allow any additional capabilities in terms of data connectivity, movement or transformation, or does it just expose existing GCP functionality? (i.e. - point to other Dataflow pipeline / Spark features) Could you point to data in another project by adding it to a zone in Dataplex, and then create a new transformed view of the data in Dataplex?
Also, the Dataplex Secure tab allows you to apply permissions onto a set of data assets that may span multiple projects, right?
Solved! Go to Solution.
Hi @steveh1 ,
In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.
If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.
Data ends up in these BigQuery datasets through various means:
Curating data in the context of Dataplex involves a few key activities:
The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.
Hi @steveh1 ,
Dataplex can indeed facilitate this process efficiently. By creating "assets" in Dataplex that reference data stored in BigQuery and other services, you can significantly reduce the need to duplicate data. This setup not only cuts costs but also simplifies your data management landscape.
While Dataplex itself doesn’t directly move or transform data, it integrates seamlessly with existing GCP services like Dataflow and Dataproc. This means you can manage and orchestrate your data transformation processes effectively using these tools, with Dataplex providing a robust centralized governance layer. For instance, after you create a transformed view in BigQuery using SQL, Dataplex can catalog this view, thereby enhancing governance and improving data comprehension within your organization.
Regarding your question about the Dataplex Secure tab: Yes, you can use it to apply permissions to a set of data assets across multiple projects. This feature enables you to manage access controls at the lake, zone, or asset level, centralizing security management and simplifying compliance with regulatory requirements.
Dataplex offers extensive benefits for data connectivity and governance without directly handling data transformations. It's designed to complement your existing GCP tools, helping to manage, monitor, and secure data transformations and integrations across projects.
ms4446,
Thanks for that response. Just to clarify, this common Explore errors website says that we need to make sure the user has read permissions on the underlying Cloud Storage and BigQuery assets. So, it sounds like Dataplex does not actually apply permissions to the objects that are pointed to by Dataplex assets. It just makes a pointer to them that can be classified & have permissions applied. So, if you have a group of Dataplex users and you want to manage their permissions to view BigQuery tables & views using Dataplex, then you would have to give everyone access to everything in BigQuery, and then limit their access within Dataplex OR you would have to manually edit their access to tables & views in BigQuery to mirror the access that you grant them in Dataplex. Is that correct?
Hi @steveh1,
Yes, you're right in your understanding of how permissions work in Dataplex in conjunction with GCS and BigQuery assets. Dataplex essentially acts as a management layer that helps you organize and govern your data, but it does not override the fundamental access control requirements of the underlying storage systems.
When you set up Dataplex assets that reference data in BigQuery or GCS, the actual data objects themselves are not altered in terms of their permissions. What this means is that even though you can apply permissions at the Dataplex level to manage who can see or interact with these assets within Dataplex, the users still need to have the appropriate permissions on the underlying BigQuery tables or GCS objects to access them.
Therefore, as you mentioned, if you have a group of Dataplex users and you want to manage their access to BigQuery tables and views, you have two primary options:
Both approaches require careful planning to ensure that security and governance policies are adhered to across your data landscape.
Hi,
I have a question on this. If I have granted a new user to particular dataset which is attached as an asset to the zone from the dataplex layer, do I need to assign the user same level of permissions at the bigquery console or the permissions I have set at Dataplex layer will be propagated to the Bigquery layer. My assumption is that, with Dataplex as a Single governance layer, I control control access to all bigquery datasets/cloud storage buckets from the Dataplex console without granting them any addtional access at Bigquery console/Cloud storage console.
When using Dataplex as a governance layer for managing permissions across BigQuery datasets and GCS buckets, it is important to understand how permissions are propagated and synchronized between Dataplex and these underlying services.
Permissions Propagation
Centralized Management: Dataplex is designed to centralize the management of data assets, including setting access controls. When you configure permissions for a dataset within Dataplex, those permissions are intended to apply directly to the underlying data services, such as BigQuery.
Integrated IAM Policies: Both Dataplex and BigQuery utilize IAM (Identity and Access Management) for setting permissions. Permissions assigned in Dataplex are integrated with the IAM policies of the corresponding BigQuery dataset. As a result, permissions set in Dataplex should automatically propagate to BigQuery without the need for redundant configuration.
Practical Implementation
Synchronization Delays: Although the integration is designed to be seamless, there may be brief delays as IAM policies are updated across services. It's important to account for this in time-sensitive scenarios.
Role Alignment: Ensure that the roles assigned in Dataplex reflect the intended access levels for both Dataplex and BigQuery. Dataplex roles are specifically crafted to align with standard GCP roles, modified for a governance focus.
Verification of Permissions: After configuring permissions in Dataplex, it is advisable to verify that these permissions are accurately reflected in BigQuery. This verification step ensures that there are no discrepancies and that the data access levels are as intended.
In general, you should not need to manually set permissions in BigQuery if you have already configured them in Dataplex, as Dataplex is engineered to manage these permissions across your data assets. However, verifying the effective permissions post-configuration is a best practice that ensures compliance with your organization’s access policies and security standards. This practice not only helps in maintaining security but also ensures that governance remains consistent and effective across your data landscape.
Thank you @ms4446 for clarifying y doubt. I have few other questions as well as I am more concentrating on the governance part. Please help me to get a clear picture.
Through the "secure" feature of the "Manage Lakes" section in the Dataplex layer, I believe we can grant access to the users at lake/zone/asset level. [ Until Dataset level ]. I do not see option to grant access to users at table level instead of complete dataset level in this section. How can we achieve it through the dataplex. Is attribute store a in the Governance can be of help in this regard ? Is attribute store in dataplex is similar to policy tags in Bigquery?
Dataplex offers robust data management capabilities, including detailed governance and security controls. Understanding the granularity of these controls can help you effectively manage access to your data. Here’s how Dataplex handles permissions, and the role of attribute stores in governance:
Granularity of Permissions in Dataplex
In Dataplex, permissions can be assigned at the lake, zone, or asset level. This structure typically allows you to control access to datasets as a whole rather than to specific elements within those datasets, such as tables or columns:
Managing Table-Level Permissions
While Dataplex does not provide direct table-level access control through its "secure" feature in the "Manage Lakes" section, you can manage finer-grained permissions using BigQuery’s native capabilities:
Attribute Store and Its Role in Governance
The Attribute Store in Dataplex is a governance feature that helps you catalog metadata about your data assets. However, it's important to clarify its function in relation to access control:
To manage access at the table level within BigQuery datasets governed by Dataplex, you will need to utilize BigQuery's native IAM and policy tag features. Dataplex facilitates broad governance and security management at higher levels (lake, zone, asset), while finer-grained control within datasets, particularly at the table and column levels, should be configured directly in BigQuery. This approach ensures that you can maintain robust governance through Dataplex while leveraging BigQuery's capabilities for detailed access control.
Hi @ms4446 ,
Thanks for taking time to answer. I might be misunderstanding, but could you be thinking of tag templates instead of the Attribute Store?
As per the below google documentation , attribute store is talking more on the access restrictions instead of metadata. So, I am thinking like, we can restrict or grant access to particular table in a dataset using attribute store as well as at column level. Can you please let me know if my understanding is correct.
https://cloud.google.com/dataplex/docs/attribute-store
Regards, Naveen.
Any comments here.
The Dataplex Attribute Store provides a robust mechanism for defining and managing fine-grained access controls at the table and column levels. This capability integrates with BigQuery policy tags and enhances your ability to govern data access within the Dataplex framework. Your understanding that the Attribute Store can be used for both table-level and column-level access control is correct, making it a powerful tool for centralized governance in Dataplex.
Key Features:
Attribute Taxonomy:
Resource Specifications:
Policy Behaviors:
ms4446,
Thanks again for the response. I have a couple more questions on Dataplex capability. I think I know the answer to these, but I want to have your response as a second opinion:
1) Is there any way to connect an external BI tool (Power BI, Qlik, Domo, Tableau, ...) to a Dataplex asset, to query it like the BI tool could query a BigQuery table?
2) Is there a way to run a query against Dataplex assets that generates a BigQuery table or view? (Referencing the <zone>.<table> object that can span across projects in the query, rather than the <dataset>.<table> object from BigQuery.)
Connecting External BI Tools to Dataplex Assets: Dataplex itself is not designed to serve as a direct data source for BI tools like Power BI, Qlik, Domo, or Tableau. Instead, Dataplex is primarily a data management and governance platform that organizes data assets stored in BigQuery, Cloud Storage, and other supported Google Cloud services. To use data in Dataplex with a BI tool, you would typically access the data through the underlying service where the data is stored. For instance, if your data asset in Dataplex is a BigQuery dataset, you would connect your BI tool directly to this BigQuery dataset in the usual manner.
Running Queries Against Dataplex Assets to Generate BigQuery Tables or Views: As of now, Dataplex does not provide a direct way to run queries against assets that generate BigQuery tables or views under the <zone>.<table> naming convention spanning across projects. Dataplex manages and governs data but does not replace or extend the querying capabilities of BigQuery. You would need to access and query your data using standard BigQuery SQL queries directed at the specific datasets and tables within BigQuery, and not through Dataplex. If your goal is to simplify querying across multiple datasets and projects, you might consider setting up BigQuery views or scheduled queries that consolidate your data as needed, which can then be cataloged and governed through Dataplex.
ms4446,
Thanks again for the response. To clarify, what data lands in the BigQuery datasets that are created for each Dataplex zone? Is that only metadata, or does it include transformed data that could be queried? How does that data get there? What does "curating" the data mean in Dataplex?
Hi @steveh1 ,
In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.
If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.
Data ends up in these BigQuery datasets through various means:
Curating data in the context of Dataplex involves a few key activities:
The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.
ms4446 - Thanks for all your time on this & detailed replies. That really helps us understand the capabilities and applicability of Dataplex!