Solved: Dataplex Capabilities

steveh1

Just kicking the tires here on Dataplex to see if it is the right fit. We are interested in its governance capabilities for the future, but right now we are interested to see if it offers any benefits for managing BigQuery transformations across multiple GCP projects. We were thinking that Dataplex might allow us to just point to some of the data in other projects that we have been duplicating due to source projects being in different regions. Does it allow any additional capabilities in terms of data connectivity, movement or transformation, or does it just expose existing GCP functionality? (i.e. - point to other Dataflow pipeline / Spark features) Could you point to data in another project by adding it to a zone in Dataplex, and then create a new transformed view of the data in Dataplex?

Also, the Dataplex Secure tab allows you to apply permissions onto a set of data assets that may span multiple projects, right?

ms4446

Hi @steveh1 ,

In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.

If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.

Data ends up in these BigQuery datasets through various means:

External Data Processing Tools: Tools like Dataflow, Dataproc, or external ETL tools can process data and load the results into BigQuery datasets.
Manual Processes: Data engineers or scientists might manually create or update datasets as part of their workflows.
Automated Workflows: Scheduled scripts or queries, as part of a broader data pipeline, might populate or update these datasets periodically.

Curating data in the context of Dataplex involves a few key activities:

Cataloging Data Assets: Organizing and classifying data assets within lakes and zones for easier access and management. This includes assigning metadata, tags, and descriptions to improve discoverability and usability.
Managing Data Quality: Implementing checks and balances to ensure data integrity and consistency across different data assets. This might involve setting up data quality rules, validations, and monitoring to maintain high data standards.
Data Lineage and Tracking: Keeping track of where data comes from, how it is processed, and where it moves over time. This is crucial for compliance, troubleshooting, and optimizing data workflows.
Applying Governance Policies: Implementing and enforcing data governance policies that ensure data security, privacy, and compliance with regulatory requirements. This includes managing access permissions, audit logs, and compliance checks.

The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.

View solution in original post

ms4446

Hi @steveh1 ,

Dataplex can indeed facilitate this process efficiently. By creating "assets" in Dataplex that reference data stored in BigQuery and other services, you can significantly reduce the need to duplicate data. This setup not only cuts costs but also simplifies your data management landscape.

While Dataplex itself doesn’t directly move or transform data, it integrates seamlessly with existing GCP services like Dataflow and Dataproc. This means you can manage and orchestrate your data transformation processes effectively using these tools, with Dataplex providing a robust centralized governance layer. For instance, after you create a transformed view in BigQuery using SQL, Dataplex can catalog this view, thereby enhancing governance and improving data comprehension within your organization.

Regarding your question about the Dataplex Secure tab: Yes, you can use it to apply permissions to a set of data assets across multiple projects. This feature enables you to manage access controls at the lake, zone, or asset level, centralizing security management and simplifying compliance with regulatory requirements.

Dataplex offers extensive benefits for data connectivity and governance without directly handling data transformations. It's designed to complement your existing GCP tools, helping to manage, monitor, and secure data transformations and integrations across projects.

steveh1

ms4446,

Thanks for that response. Just to clarify, this common Explore errors website says that we need to make sure the user has read permissions on the underlying Cloud Storage and BigQuery assets. So, it sounds like Dataplex does not actually apply permissions to the objects that are pointed to by Dataplex assets. It just makes a pointer to them that can be classified & have permissions applied. So, if you have a group of Dataplex users and you want to manage their permissions to view BigQuery tables & views using Dataplex, then you would have to give everyone access to everything in BigQuery, and then limit their access within Dataplex OR you would have to manually edit their access to tables & views in BigQuery to mirror the access that you grant them in Dataplex. Is that correct?

ms4446

Hi @steveh1,

Yes, you're right in your understanding of how permissions work in Dataplex in conjunction with GCS and BigQuery assets. Dataplex essentially acts as a management layer that helps you organize and govern your data, but it does not override the fundamental access control requirements of the underlying storage systems.

When you set up Dataplex assets that reference data in BigQuery or GCS, the actual data objects themselves are not altered in terms of their permissions. What this means is that even though you can apply permissions at the Dataplex level to manage who can see or interact with these assets within Dataplex, the users still need to have the appropriate permissions on the underlying BigQuery tables or GCS objects to access them.

Therefore, as you mentioned, if you have a group of Dataplex users and you want to manage their access to BigQuery tables and views, you have two primary options:

Grant Broad Access: Give everyone broad access to the data in BigQuery, and then use Dataplex to refine and restrict access more granularly within the scope of what Dataplex manages.
Mirror Permissions Manually: Manually adjust access in BigQuery to match the access controls you set up in Dataplex, ensuring that only the appropriate users have the necessary permissions both in Dataplex and in BigQuery itself.

Both approaches require careful planning to ensure that security and governance policies are adhered to across your data landscape.

steveh1

ms4446,

Thanks again for the response. I have a couple more questions on Dataplex capability. I think I know the answer to these, but I want to have your response as a second opinion:
1) Is there any way to connect an external BI tool (Power BI, Qlik, Domo, Tableau, ...) to a Dataplex asset, to query it like the BI tool could query a BigQuery table?
2) Is there a way to run a query against Dataplex assets that generates a BigQuery table or view? (Referencing the <zone>.<table> object that can span across projects in the query, rather than the <dataset>.<table> object from BigQuery.)

ms4446

Connecting External BI Tools to Dataplex Assets: Dataplex itself is not designed to serve as a direct data source for BI tools like Power BI, Qlik, Domo, or Tableau. Instead, Dataplex is primarily a data management and governance platform that organizes data assets stored in BigQuery, Cloud Storage, and other supported Google Cloud services. To use data in Dataplex with a BI tool, you would typically access the data through the underlying service where the data is stored. For instance, if your data asset in Dataplex is a BigQuery dataset, you would connect your BI tool directly to this BigQuery dataset in the usual manner.
Running Queries Against Dataplex Assets to Generate BigQuery Tables or Views: As of now, Dataplex does not provide a direct way to run queries against assets that generate BigQuery tables or views under the <zone>.<table> naming convention spanning across projects. Dataplex manages and governs data but does not replace or extend the querying capabilities of BigQuery. You would need to access and query your data using standard BigQuery SQL queries directed at the specific datasets and tables within BigQuery, and not through Dataplex. If your goal is to simplify querying across multiple datasets and projects, you might consider setting up BigQuery views or scheduled queries that consolidate your data as needed, which can then be cataloged and governed through Dataplex.

steveh1

ms4446,

Thanks again for the response. To clarify, what data lands in the BigQuery datasets that are created for each Dataplex zone? Is that only metadata, or does it include transformed data that could be queried? How does that data get there? What does "curating" the data mean in Dataplex?

ms4446

Hi @steveh1 ,

In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.

If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.

Data ends up in these BigQuery datasets through various means:

External Data Processing Tools: Tools like Dataflow, Dataproc, or external ETL tools can process data and load the results into BigQuery datasets.
Manual Processes: Data engineers or scientists might manually create or update datasets as part of their workflows.
Automated Workflows: Scheduled scripts or queries, as part of a broader data pipeline, might populate or update these datasets periodically.

Curating data in the context of Dataplex involves a few key activities:

Cataloging Data Assets: Organizing and classifying data assets within lakes and zones for easier access and management. This includes assigning metadata, tags, and descriptions to improve discoverability and usability.
Managing Data Quality: Implementing checks and balances to ensure data integrity and consistency across different data assets. This might involve setting up data quality rules, validations, and monitoring to maintain high data standards.
Data Lineage and Tracking: Keeping track of where data comes from, how it is processed, and where it moves over time. This is crucial for compliance, troubleshooting, and optimizing data workflows.
Applying Governance Policies: Implementing and enforcing data governance policies that ensure data security, privacy, and compliance with regulatory requirements. This includes managing access permissions, audit logs, and compliance checks.

The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.

steveh1

ms4446 - Thanks for all your time on this & detailed replies. That really helps us understand the capabilities and applicability of Dataplex!