Just kicking the tires here on Dataplex to see if it is the right fit. We are interested in its governance capabilities for the future, but right now we are interested to see if it offers any benefits for managing BigQuery transformations across multiple GCP projects. We were thinking that Dataplex might allow us to just point to some of the data in other projects that we have been duplicating due to source projects being in different regions. Does it allow any additional capabilities in terms of data connectivity, movement or transformation, or does it just expose existing GCP functionality? (i.e. - point to other Dataflow pipeline / Spark features) Could you point to data in another project by adding it to a zone in Dataplex, and then create a new transformed view of the data in Dataplex?
Also, the Dataplex Secure tab allows you to apply permissions onto a set of data assets that may span multiple projects, right?
Solved! Go to Solution.
Hi @steveh1 ,
In Dataplex, the BigQuery datasets created for each zone primarily serve as repositories for metadata management. This metadata might include details about the data assets within the zone, such as schemas, descriptions, and metadata for governance and cataloging purposes. The primary purpose of these datasets is not to store the actual data (like the raw or transformed data contents) but to manage metadata that facilitates better data understanding and governance across your data lakes.
If transformed data is being stored in a BigQuery dataset within a Dataplex setup, it would typically involve processes set up outside of the direct capabilities of Dataplex. For instance, you might use BigQuery's data transformation tools (like SQL queries or BigQuery ML) or integrate with Dataflow or Dataproc for transformation jobs that then store their outputs in BigQuery. The integration of Dataplex is more about orchestrating and managing these processes, rather than directly executing them.
Data ends up in these BigQuery datasets through various means:
Curating data in the context of Dataplex involves a few key activities:
The BigQuery datasets associated with Dataplex zones are more focused on handling metadata, with actual data processing and transformation being managed through other tools that integrate with Dataplex. Curating data involves organizing, enhancing, and ensuring the governance of data within the platform.