Solved: Dataplex Capabilities - Page 2

steveh1 · 04-15-2024 04:28 PM

Just kicking the tires here on Dataplex to see if it is the right fit. We are interested in its governance capabilities for the future, but right now we are interested to see if it offers any benefits for managing BigQuery transformations across multiple GCP projects. We were thinking that Dataplex might allow us to just point to some of the data in other projects that we have been duplicating due to source projects being in different regions. Does it allow any additional capabilities in terms of data connectivity, movement or transformation, or does it just expose existing GCP functionality? (i.e. - point to other Dataflow pipeline / Spark features) Could you point to data in another project by adding it to a zone in Dataplex, and then create a new transformed view of the data in Dataplex?

Also, the Dataplex Secure tab allows you to apply permissions onto a set of data assets that may span multiple projects, right?

ms4446

Hi @steveh1 ,

In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.

If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.

Data ends up in these BigQuery datasets through various means:

External Data Processing Tools: Tools like Dataflow, Dataproc, or external ETL tools can process data and load the results into BigQuery datasets.
Manual Processes: Data engineers or scientists might manually create or update datasets as part of their workflows.
Automated Workflows: Scheduled scripts or queries, as part of a broader data pipeline, might populate or update these datasets periodically.

Curating data in the context of Dataplex involves a few key activities:

Cataloging Data Assets: Organizing and classifying data assets within lakes and zones for easier access and management. This includes assigning metadata, tags, and descriptions to improve discoverability and usability.
Managing Data Quality: Implementing checks and balances to ensure data integrity and consistency across different data assets. This might involve setting up data quality rules, validations, and monitoring to maintain high data standards.
Data Lineage and Tracking: Keeping track of where data comes from, how it is processed, and where it moves over time. This is crucial for compliance, troubleshooting, and optimizing data workflows.
Applying Governance Policies: Implementing and enforcing data governance policies that ensure data security, privacy, and compliance with regulatory requirements. This includes managing access permissions, audit logs, and compliance checks.

The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.

View solution in original post