Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to share Google Dataplex/Data Catalog metadata outside the organization

We are sharing some BQ datasets with a third party (outside our org) using analytics hub. So, the subscriber is able to create a linked dataset in their project and any queries they run are billed to their project.

Now, we are exploring enriching the metadata for some of these datasets that are being shared using Dataplex. Are there any use cases/best practices on how this metadata can be shared? Two options come to mind:

  • Option1: Enrich the metadata in Dataplex/Data Catalog and push it to a BQ dataset. And then share the same using Analytics Hub. This would be the cleanest solution. But I am not clear on the following points:
    • Is it possible to push Dataplex/Data Catalog metadata to BQ?
    • It seems possible to create two types of metadata for a BQ table in Dataplex, the Dataplex entity and the Data Catalog entry. Can both/either/none be pushed? Link about question related to Dataplex vs. Data Catalog metadata: Link
  • Option2: Give a user from the 3rd party permissions to view the metadata in Dataplex. Nor sure if this possible, to have someone outside the org. have access to only the metadata for certain datasets in dataplex. Maybe organize data in lakes and then grant access at the lake level, but in that case, would not be able to share the Data Catalog metadata. This would however not be the preferred approach, to have a 3rd party user access resources directly inside the project.
Solved Solved
5 5 3,181
2 ACCEPTED SOLUTIONS

Dataplex is designed to manage data across various storage mediums, organizing it into lakes and zones for structured access. It is not primarily used for metadata management but does maintain some metadata about these structures.

Data Catalog acts as a centralized metadata repository that enables search and discovery across various data assets in Google Cloud. It does not manage data directly but rather the metadata that describes data assets, such as those in BigQuery.

Analytics Hub:

Primarily used for sharing datasets, Analytics Hub does not directly handle the sharing of raw metadata stored in Dataplex or Data Catalog without converting this metadata into a structured dataset first.

Recommended Approach: A Hybrid Solution

Curate Essential Metadata:

Identify the most valuable metadata elements to share with the third party, which might include:

  • Technical Metadata: Column names, data types, descriptions.
  • Business Metadata: Ownership details, classifications, and tagging.
  • Data Lineage: Details concerning data origins and transformations, although Data Catalog's capabilities here are limited.
  • Dataplex-Specific Metadata: Information about lakes and zones if relevant.

Structured Metadata Export:

  • Dataplex: Use custom scripts or processes (potentially leveraging Google Cloud services like Dataflow) to programmatically extract metadata from Dataplex, as there is no direct API for exporting metadata for user consumption.
  • Data Catalog: Utilize the Data Catalog API to systematically export metadata associated with BigQuery datasets,likely requiring transformation into a BigQuery-friendly format.

Create a Metadata Dataset:

  • Prepare a BigQuery dataset specifically designed for sharing, which may include separate tables for different metadata types (technical, business, lineage).
  • Use clear and descriptive naming conventions for easy understanding and consumption by third parties.
  • If sharing Dataplex entities, include a mapping table correlating them with their respective BigQuery datasets.

Share Metadata via Analytics Hub:

  • Publish your well-structured metadata dataset on Analytics Hub, allowing third parties to access it as they would any shared dataset.

Additional Considerations:

  • Metadata Security: Review all metadata for sensitive information. Utilize obfuscation or anonymization for sensitive elements and employ Data Catalog’s IAM controls to enhance security.
  • Metadata Updates: Implement a regular process to update the exported metadata in BigQuery, ensuring that changes in source systems are reflected timely.
  • Metadata Documentation: Provide comprehensive documentation explaining the structure, meaning, and special considerations of the metadata to ensure it is usable and understandable.

Example (Illustrative):

Imagine a Dataplex setup with a lake named "customer_data" and a table "customer_transactions". You intend to share this metadata:

  • Dataplex: Lake name, table name.
  • Data Catalog: Column names, data types, descriptions.
  • Custom Metadata: A "PII Flag" column indicating the presence of Personally Identifiable Information (PII).

Your metadata dataset structure could look like this:

Entity Type

Entity Name

Column Name

Data Type

Description

PII Flag

Lake

customer_data

customer_id

STRING

Customer's unique ID

Yes

Table

customer_transactions

transaction_amount

FLOAT64

Transaction amount in USD

No

drive_spreadsheetExport to Sheets

This hybrid solution leverages the strengths of Dataplex, Data Catalog, and Analytics Hub. It gives full control over which metadata elements are shared and provides a structured, easily consumable method for third parties to access your metadata.

View solution in original post

Sorry for the confusion. The "Metadata" tab mentioned in the older Google Cloud Community post is inaccurate.There is no reference to this "Metadata" tab in the current Dataplex documentation, and many users, including yourself, have reported not being able to locate it.

As of now, the recommended approach for pushing Dataplex metadata to BigQuery involves using the Dataplex REST API to programmatically export metadata and then load it into BigQuery using your preferred method (e.g., Python scripts, Cloud Functions).

Alternative to Consider:

While the direct "Metadata" tab option seems unavailable, you might want to explore the Dataplex Metadata Export feature if available. Please note, as of the latest documentation, Dataplex does not explicitly mention an automatic metadata export to Google Cloud Storage in Avro format for direct use. You would typically need to implement custom solutions for exporting metadata. Check for the most current capabilities in the Dataplex documenta

View solution in original post

5 REPLIES 5