Solved: Dataplex Capabilities

steveh1 · 04-15-2024 04:28 PM

Just kicking the tires here on Dataplex to see if it is the right fit. We are interested in its governance capabilities for the future, but right now we are interested to see if it offers any benefits for managing BigQuery transformations across multiple GCP projects. We were thinking that Dataplex might allow us to just point to some of the data in other projects that we have been duplicating due to source projects being in different regions. Does it allow any additional capabilities in terms of data connectivity, movement or transformation, or does it just expose existing GCP functionality? (i.e. - point to other Dataflow pipeline / Spark features) Could you point to data in another project by adding it to a zone in Dataplex, and then create a new transformed view of the data in Dataplex?

Also, the Dataplex Secure tab allows you to apply permissions onto a set of data assets that may span multiple projects, right?

ms4446

Hi @steveh1 ,

In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.

If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.

Data ends up in these BigQuery datasets through various means:

External Data Processing Tools: Tools like Dataflow, Dataproc, or external ETL tools can process data and load the results into BigQuery datasets.
Manual Processes: Data engineers or scientists might manually create or update datasets as part of their workflows.
Automated Workflows: Scheduled scripts or queries, as part of a broader data pipeline, might populate or update these datasets periodically.

Curating data in the context of Dataplex involves a few key activities:

Cataloging Data Assets: Organizing and classifying data assets within lakes and zones for easier access and management. This includes assigning metadata, tags, and descriptions to improve discoverability and usability.
Managing Data Quality: Implementing checks and balances to ensure data integrity and consistency across different data assets. This might involve setting up data quality rules, validations, and monitoring to maintain high data standards.
Data Lineage and Tracking: Keeping track of where data comes from, how it is processed, and where it moves over time. This is crucial for compliance, troubleshooting, and optimizing data workflows.
Applying Governance Policies: Implementing and enforcing data governance policies that ensure data security, privacy, and compliance with regulatory requirements. This includes managing access permissions, audit logs, and compliance checks.

The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.

View solution in original post

ms4446

Hi @steveh1 ,

Dataplex can indeed facilitate this process efficiently. By creating "assets" in Dataplex that reference data stored in BigQuery and other services, you can significantly reduce the need to duplicate data. This setup not only cuts costs but also simplifies your data management landscape.

While Dataplex itself doesn’t directly move or transform data, it integrates seamlessly with existing GCP services like Dataflow and Dataproc. This means you can manage and orchestrate your data transformation processes effectively using these tools, with Dataplex providing a robust centralized governance layer. For instance, after you create a transformed view in BigQuery using SQL, Dataplex can catalog this view, thereby enhancing governance and improving data comprehension within your organization.

Regarding your question about the Dataplex Secure tab: Yes, you can use it to apply permissions to a set of data assets across multiple projects. This feature enables you to manage access controls at the lake, zone, or asset level, centralizing security management and simplifying compliance with regulatory requirements.

Dataplex offers extensive benefits for data connectivity and governance without directly handling data transformations. It's designed to complement your existing GCP tools, helping to manage, monitor, and secure data transformations and integrations across projects.

steveh1

ms4446,

Thanks for that response. Just to clarify, this common Explore errors website says that we need to make sure the user has read permissions on the underlying Cloud Storage and BigQuery assets. So, it sounds like Dataplex does not actually apply permissions to the objects that are pointed to by Dataplex assets. It just makes a pointer to them that can be classified & have permissions applied. So, if you have a group of Dataplex users and you want to manage their permissions to view BigQuery tables & views using Dataplex, then you would have to give everyone access to everything in BigQuery, and then limit their access within Dataplex OR you would have to manually edit their access to tables & views in BigQuery to mirror the access that you grant them in Dataplex. Is that correct?

ms4446

Hi @steveh1,

Yes, you're right in your understanding of how permissions work in Dataplex in conjunction with GCS and BigQuery assets. Dataplex essentially acts as a management layer that helps you organize and govern your data, but it does not override the fundamental access control requirements of the underlying storage systems.

When you set up Dataplex assets that reference data in BigQuery or GCS, the actual data objects themselves are not altered in terms of their permissions. What this means is that even though you can apply permissions at the Dataplex level to manage who can see or interact with these assets within Dataplex, the users still need to have the appropriate permissions on the underlying BigQuery tables or GCS objects to access them.

Therefore, as you mentioned, if you have a group of Dataplex users and you want to manage their access to BigQuery tables and views, you have two primary options:

Grant Broad Access: Give everyone broad access to the data in BigQuery, and then use Dataplex to refine and restrict access more granularly within the scope of what Dataplex manages.
Mirror Permissions Manually: Manually adjust access in BigQuery to match the access controls you set up in Dataplex, ensuring that only the appropriate users have the necessary permissions both in Dataplex and in BigQuery itself.

Both approaches require careful planning to ensure that security and governance policies are adhered to across your data landscape.

NaveenManyam

Hi,

I have a question on this. If I have granted a new user to particular dataset which is attached as an asset to the zone from the dataplex layer, do I need to assign the user same level of permissions at the bigquery console or the permissions I have set at Dataplex layer will be propagated to the Bigquery layer. My assumption is that, with Dataplex as a Single governance layer, I control control access to all bigquery datasets/cloud storage buckets from the Dataplex console without granting them any addtional access at Bigquery console/Cloud storage console.

ms4446

When using Dataplex as a governance layer for managing permissions across BigQuery datasets and GCS buckets, it is important to understand how permissions are propagated and synchronized between Dataplex and these underlying services.

Permissions Propagation

Centralized Management: Dataplex is designed to centralize the management of data assets, including setting access controls. When you configure permissions for a dataset within Dataplex, those permissions are intended to apply directly to the underlying data services, such as BigQuery.
Integrated IAM Policies: Both Dataplex and BigQuery utilize IAM (Identity and Access Management) for setting permissions. Permissions assigned in Dataplex are integrated with the IAM policies of the corresponding BigQuery dataset. As a result, permissions set in Dataplex should automatically propagate to BigQuery without the need for redundant configuration.

Practical Implementation

Synchronization Delays: Although the integration is designed to be seamless, there may be brief delays as IAM policies are updated across services. It's important to account for this in time-sensitive scenarios.
Role Alignment: Ensure that the roles assigned in Dataplex reflect the intended access levels for both Dataplex and BigQuery. Dataplex roles are specifically crafted to align with standard GCP roles, modified for a governance focus.
Verification of Permissions: After configuring permissions in Dataplex, it is advisable to verify that these permissions are accurately reflected in BigQuery. This verification step ensures that there are no discrepancies and that the data access levels are as intended.

In general, you should not need to manually set permissions in BigQuery if you have already configured them in Dataplex, as Dataplex is engineered to manage these permissions across your data assets. However, verifying the effective permissions post-configuration is a best practice that ensures compliance with your organization’s access policies and security standards. This practice not only helps in maintaining security but also ensures that governance remains consistent and effective across your data landscape.

NaveenManyam

Thank you @ms4446 for clarifying y doubt. I have few other questions as well as I am more concentrating on the governance part. Please help me to get a clear picture.

Through the "secure" feature of the "Manage Lakes" section in the Dataplex layer, I believe we can grant access to the users at lake/zone/asset level. [ Until Dataset level ]. I do not see option to grant access to users at table level instead of complete dataset level in this section. How can we achieve it through the dataplex. Is attribute store a in the Governance can be of help in this regard ? Is attribute store in dataplex is similar to policy tags in Bigquery?

ms4446

Dataplex offers robust data management capabilities, including detailed governance and security controls. Understanding the granularity of these controls can help you effectively manage access to your data. Here’s how Dataplex handles permissions, and the role of attribute stores in governance:

Granularity of Permissions in Dataplex

In Dataplex, permissions can be assigned at the lake, zone, or asset level. This structure typically allows you to control access to datasets as a whole rather than to specific elements within those datasets, such as tables or columns:

Lake-Level Access: Controls access to all zones and assets within the lake.
Zone-Level Access: Governs access to assets grouped within a particular zone.
Asset-Level Access: Manages access to specific assets, such as a BigQuery dataset or a GCS bucket.

Managing Table-Level Permissions

While Dataplex does not provide direct table-level access control through its "secure" feature in the "Manage Lakes" section, you can manage finer-grained permissions using BigQuery’s native capabilities:

BigQuery IAM Policies: You can set IAM policies directly on BigQuery tables to manage access at the table level. This approach involves configuring permissions separately in BigQuery, alongside the broader permissions set in Dataplex.
BigQuery Policy Tags (Data Catalog Tags): For more granular control, such as column-level security, you can use BigQuery policy tags. These tags can be assigned to specific columns within a table and linked with access policies to control who can view or query specific data.

Attribute Store and Its Role in Governance

The Attribute Store in Dataplex is a governance feature that helps you catalog metadata about your data assets. However, it's important to clarify its function in relation to access control:

Functionality: The Attribute Store is designed to store metadata and annotations about data assets, which can include descriptive tags, ownership information, and classification labels. It is used primarily for data discovery, search, and lineage tracking, rather than for direct access control.
Comparison to BigQuery Policy Tags: Unlike BigQuery policy tags, which are directly used to enforce access controls at the dataset, table, or column level, the Attribute Store in Dataplex does not directly influence access policies. Instead, it enhances the metadata management and governance capabilities of Dataplex.

To manage access at the table level within BigQuery datasets governed by Dataplex, you will need to utilize BigQuery's native IAM and policy tag features. Dataplex facilitates broad governance and security management at higher levels (lake, zone, asset), while finer-grained control within datasets, particularly at the table and column levels, should be configured directly in BigQuery. This approach ensures that you can maintain robust governance through Dataplex while leveraging BigQuery's capabilities for detailed access control.

NaveenManyam

Hi @ms4446 ,

Thanks for taking time to answer. I might be misunderstanding, but could you be thinking of tag templates instead of the Attribute Store?

As per the below google documentation , attribute store is talking more on the access restrictions instead of metadata. So, I am thinking like, we can restrict or grant access to particular table in a dataset using attribute store as well as at column level. Can you please let me know if my understanding is correct.

https://cloud.google.com/dataplex/docs/attribute-store

Regards, Naveen.

NaveenManyam

Any comments here.

ms4446

The Dataplex Attribute Store provides a robust mechanism for defining and managing fine-grained access controls at the table and column levels. This capability integrates with BigQuery policy tags and enhances your ability to govern data access within the Dataplex framework. Your understanding that the Attribute Store can be used for both table-level and column-level access control is correct, making it a powerful tool for centralized governance in Dataplex.

Key Features:

Attribute Taxonomy:
- Hierarchy: Attributes can be organized in a hierarchical taxonomy, allowing for inheritance and merging of access specifications.
- Parent and Child Attributes: Parent attributes can have broad access controls, which child attributes can inherit and refine further.
Resource Specifications:
- Table-Level Access: You can specify access controls for tables using resource specifications. Dataplex propagates these specifications to apply IAM roles to users or groups for the tables associated with the attributes.
- Column-Level Access: You can specify column-level access controls. When you associate an attribute with a column, it adds a BigQuery column policy tag to that column.
Policy Behaviors:
- Merging Specifications: When multiple attributes are associated with a table or column, Dataplex merges their specifications to create a unified access policy.
- Propagation: The resulting access policies are propagated to the underlying BigQuery tables and columns.

steveh1

ms4446,

Thanks again for the response. I have a couple more questions on Dataplex capability. I think I know the answer to these, but I want to have your response as a second opinion:
1) Is there any way to connect an external BI tool (Power BI, Qlik, Domo, Tableau, ...) to a Dataplex asset, to query it like the BI tool could query a BigQuery table?
2) Is there a way to run a query against Dataplex assets that generates a BigQuery table or view? (Referencing the <zone>.<table> object that can span across projects in the query, rather than the <dataset>.<table> object from BigQuery.)

ms4446

Connecting External BI Tools to Dataplex Assets: Dataplex itself is not designed to serve as a direct data source for BI tools like Power BI, Qlik, Domo, or Tableau. Instead, Dataplex is primarily a data management and governance platform that organizes data assets stored in BigQuery, Cloud Storage, and other supported Google Cloud services. To use data in Dataplex with a BI tool, you would typically access the data through the underlying service where the data is stored. For instance, if your data asset in Dataplex is a BigQuery dataset, you would connect your BI tool directly to this BigQuery dataset in the usual manner.
Running Queries Against Dataplex Assets to Generate BigQuery Tables or Views: As of now, Dataplex does not provide a direct way to run queries against assets that generate BigQuery tables or views under the <zone>.<table> naming convention spanning across projects. Dataplex manages and governs data but does not replace or extend the querying capabilities of BigQuery. You would need to access and query your data using standard BigQuery SQL queries directed at the specific datasets and tables within BigQuery, and not through Dataplex. If your goal is to simplify querying across multiple datasets and projects, you might consider setting up BigQuery views or scheduled queries that consolidate your data as needed, which can then be cataloged and governed through Dataplex.

steveh1

ms4446,

Thanks again for the response. To clarify, what data lands in the BigQuery datasets that are created for each Dataplex zone? Is that only metadata, or does it include transformed data that could be queried? How does that data get there? What does "curating" the data mean in Dataplex?

ms4446

Hi @steveh1 ,

In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.

If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.

Data ends up in these BigQuery datasets through various means:

External Data Processing Tools: Tools like Dataflow, Dataproc, or external ETL tools can process data and load the results into BigQuery datasets.
Manual Processes: Data engineers or scientists might manually create or update datasets as part of their workflows.
Automated Workflows: Scheduled scripts or queries, as part of a broader data pipeline, might populate or update these datasets periodically.

Curating data in the context of Dataplex involves a few key activities:

Cataloging Data Assets: Organizing and classifying data assets within lakes and zones for easier access and management. This includes assigning metadata, tags, and descriptions to improve discoverability and usability.
Managing Data Quality: Implementing checks and balances to ensure data integrity and consistency across different data assets. This might involve setting up data quality rules, validations, and monitoring to maintain high data standards.
Data Lineage and Tracking: Keeping track of where data comes from, how it is processed, and where it moves over time. This is crucial for compliance, troubleshooting, and optimizing data workflows.
Applying Governance Policies: Implementing and enforcing data governance policies that ensure data security, privacy, and compliance with regulatory requirements. This includes managing access permissions, audit logs, and compliance checks.

The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.

steveh1

ms4446 - Thanks for all your time on this & detailed replies. That really helps us understand the capabilities and applicability of Dataplex!

dheerajpanyam

@steveh1 Hot off the press this blog gives a good overview of Dataplex.