Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataplex Data catalog search interface.

We are looking to have a search interface on the Dataplex metadata. As Dataplex Data catalog ingests all the metadata from Bigquery assets and it can take addtional metadata through tag templates, business glossory etc. I am looking for having search interface similar to Data catalog.    I have searched for the Data catalog APIs and I found below. Suppose if I search with any tag template name or business term name that are attached to a table, I need to get the dataset name with the description. I do not need to see the data, just the dataset name and the full description works. How can I achieve this.

https://datacatalog.googleapis.com/v1beta1/catalog:search

https://datacatalog.googleapis.com/v1beta1/entries:lookup

 

1 4 1,158
4 REPLIES 4

Below is a strategy that combines the power of Google Cloud Data Catalog and its APIs to achieve what you're looking for:

  • Dataplex Metadata in Data Catalog: Think of your Dataplex metadata—tags, business glossary terms, etc.—as entries within Data Catalog. This lets you use Data Catalog's robust search features.
  • Tag Templates and Business Glossary: If you're already using these in Dataplex, you're ahead of the game! This structured metadata is perfect for indexing in Data Catalog.
  • API Combination: We'll use two main Data Catalog APIs:
    • catalog:search: Finds entries matching your search (e.g., tag name, business term).
    • entries:lookup: Gets the full details of each found entry, including dataset name and description.

Implementation Steps

  1. Entry Creation: If your Dataplex metadata isn't in Data Catalog yet:
    • Manual: Create entries directly in Data Catalog, linking them to your BigQuery resources.
    • Automated: Write a script to sync your Dataplex metadata with Data Catalog entries regularly.
  2. Search Interface:
    • Build a user interface (web page, etc.) where users enter their search queries.
    • When a query is submitted:
      • Use catalog:search to find matching entries.
      • For each entry, use entries:lookup to get the full details.
      • Extract and display the dataset name and description in your interface.

Here's a Python code snippet demonstrating how to use the APIs to perform the search and retrieve the dataset details:

 
from google.cloud import datacatalog_v1beta1

def search_dataplex_metadata(query):
    client = datacatalog_v1beta1.DataCatalogClient()
    scope = datacatalog_v1beta1.types.SearchCatalogRequest.Scope()
    scope.include_project_ids = ["your-project-id"] 

    request = datacatalog_v1beta1.types.SearchCatalogRequest(
        query=query,
        scope=scope
    )

    search_results = client.search_catalog(request=request)

    for result in search_results:
        entry = client.lookup_entry(linked_resource=result.linked_resource)
        print(f"Dataset Name: {entry.name}, Description: {entry.description}")

# Example usage
search_dataplex_metadata("your_tag_template_name OR your_business_term")

Important Considerations

  • Permissions: Make sure the service account accessing the Data Catalog APIs has the right permissions.
  • Search Syntax: Learn Data Catalog's search syntax for effective queries.
  • Custom Attributes: Consider using these in Data Catalog to store extra Dataplex-specific metadata.

I know in Python you can get assets, entities, and datasets fairly easily. I am going to have to build out a UI on top of Dataplex to accomplish automation and even possibly replace data Discovery as we are having issues with schemas not deleting with a table and rebuilt tables getting lost in no mans land somewhere. https://cloud.google.com/data-catalog/docs/concepts/metadata

Building a UI on top of Dataplex for managing and automating metadata tasks, especially to address issues with schema synchronization and data discovery, is a great idea. Here’s a  plan on how to accomplish this, incorporating your need to handle assets, entities, and datasets, along with leveraging the Data Catalog API for metadata management:

  1. Authentication and Setup:

    • Ensure that your application can authenticate with Google Cloud services using service accounts with appropriate permissions.

    • Enable necessary APIs, including Data Catalog API and Dataplex API.

  2. Fetching Metadata:

    • Use Data Catalog API to fetch assets, entities, and datasets.

    • Implement functions to search and retrieve metadata entries.

  3. Handling Metadata Updates:

    • Automate the synchronization of metadata to ensure schemas are correctly handled when tables are deleted or rebuilt.

    • Use hooks or triggers in your data pipeline to update Data Catalog when changes occur.

  4. Building the UI:

    • Develop a user interface that allows users to search, view, and manage metadata.

    • Integrate Data Catalog API to allow users to perform searches and view detailed information about datasets, assets, and entities.

  5. Automation and Maintenance:

    • Implement scripts or background jobs that periodically check for inconsistencies in metadata and update accordingly.

    • Provide tools within the UI for users to manually trigger metadata synchronization or corrections.

Some Key Considerations

  • Error Handling: Implement robust error handling to manage API failures or inconsistencies in metadata.
  • Scalability: Ensure that your application can scale with the number of assets and metadata entries.
  • User Access Control: Implement appropriate access control mechanisms to ensure only authorized users can make changes to the metadata.
  • Monitoring: Set up monitoring and logging to track the synchronization process and identify any issues promptly.

 

Hi @ms4446 ,

Thank you so much for the detailed plan for the UI, I am trying to build a UI and in the initial stages, 
I have couple of questions
Authentication and Setup: what options do I have, can a valid user with GCP cloud credentials login ? can I integrate with IAMs of simple authentication service? is there something GCP managed service for this?
UI: any simple UI framework you can suggest, I am more a backend guy  and not a UI person 😊