Usage of Data Catalog

Hi,

Need some info on the below points. Please suggest.

1. Can data catalog collect the actual data from Pub/ Sub topics ? Or it can be just used to collect only the metadata of Pub/Sub topics.
2. Data catalog refers collecting two kinds of metadata namely business and technical metadata. Can you give some insights on both with some examples.

Thanks.

Solved Solved
0 5 186
1 ACCEPTED SOLUTION

The process of collecting detailed metadata like column names and data types from Pub/Sub topics indirectly involves the data being ingested into a structured storage or processing service (like BigQuery) where a schema can be applied. Data Catalog then collects metadata from these services, not directly from Pub/Sub topics. This integration across Google Cloud services enables a comprehensive approach to data management and metadata cataloging, enhancing discoverability and governance of data assets.

View solution in original post

5 REPLIES 5

Data Catalog serves as a centralized metadata repository for a wide array of data assets, including Pub/Sub topics. Recognizing Pub/Sub's pivotal role in real-time message exchange, the Data Catalog is tailored to catalog descriptive details about the topics, focusing on metadata rather than the ephemeral message content.

Metadata Collected: The Data Catalog meticulously captures essential metadata for Pub/Sub topics, including:

  • Topic Name: Serves as a unique identifier, simplifying topic reference.

  • Description: Offers a succinct overview of the topic's purpose.

  • Schema (if defined): Outlines the data structure for schema-enforced topics, aiding in data consistency and comprehension.

  • Creation and Update Timestamps: Chronicles the inception and modification dates, providing lifecycle insights.

  • Associated Labels: Employs key-value pairs to streamline topic organization and retrieval.

Technical vs. Business Metadata

The Data Catalog intelligently differentiates between technical and business metadata, optimizing data discovery and understanding for diverse organizational roles.

Technical Metadata:

  • Focus: Targets the data's structural, format, and technical details.

  • Examples:

    • Column Names and Data Types: Clarifies database table attributes and formats.

    • File Types: Identifies data formats, e.g., CSV, JSON, Parquet.

    • Data Sizes: Measures data volume, enhancing resource planning.

    • Storage Locations: Specifies data storage sites, facilitating access.

    • Data Lineage: Maps the data's journey, ensuring transparency and aiding in error tracing.

Business Metadata:

  • Focus: Infuses data with business context, courtesy of user contributions, making it intelligible for non-technical stakeholders.

  • Examples:

    • Data Descriptions: Demystifies data sets, e.g., "Customer Transactions."

    • Business Terms: Harmonizes data terminology with organizational lingo, e.g., "Net Revenue."

    • Data Ownership: Assigns responsibility, fostering accountability.

    • Sensitivity Classifications: Marks data sensitivity, supporting compliance.

    • Usage Guidelines: Directs data utilization, promoting best practices.

Importance of the Distinction

  • Technical Metadata: Indispensable for data professionals to navigate, comprehend, and manipulate data efficiently.

  • Business Metadata: Paramount for business users to grasp data's business relevance, enabling informed decision-making and strategic insights.

Enhancing Governance, Collaboration, and Efficiency

The Data Catalog not only facilitates robust data management but also significantly contributes to data governance and compliance efforts. By leveraging business metadata, organizations can meticulously classify data sensitivity and establish clear usage guidelines, ensuring adherence to regulatory standards and internal policies.

Moreover, the Data Catalog fosters collaboration across teams by providing a unified framework and language for data assets. This shared understanding accelerates project onboarding, enhances cross-functional teamwork, and streamlines data-driven decision-making processes.

Practical Applications and Integration

Implementing the Data Catalog can address specific organizational needs, such as:

  • Error Tracing: Utilizing data lineage to pinpoint the origins of discrepancies in reporting.

  • Onboarding Efficiency: Leveraging business metadata to quickly acclimate new employees to the organizational data landscape.

The integration process with existing Google Cloud services is straightforward, ensuring that organizations can seamlessly adopt the Data Catalog without disrupting their current workflows. This compatibility underscores the practicality and immediate value of incorporating the Data Catalog into an organization's data management ecosystem.

Hi, 

One query regarding the below one,  may be it might be basic one . Please clarify.

Column Names and Data Types: Clarifies database table attributes and formats.
How data catalog collects this from a pub/sub topic ? Is it like pub sub topic sends this to bigquery or cloud storage and from there data catalog collects this data.

Thanks.

The process of collecting detailed metadata like column names and data types from Pub/Sub topics indirectly involves the data being ingested into a structured storage or processing service (like BigQuery) where a schema can be applied. Data Catalog then collects metadata from these services, not directly from Pub/Sub topics. This integration across Google Cloud services enables a comprehensive approach to data management and metadata cataloging, enhancing discoverability and governance of data assets.

Hi @ms4446 ms4446,

I had a query regarding technical metadata vs. business metadata. In the documentation it is mentioned that, "Data Catalog handles two types of metadata: technical metadata and business metadata." And then some examples of each are given as you also mentioned. It is not mentioned anywhere but I assume that technical metadata is read only whereas business metadata is what can be added. Is my understanding correct? If not can you provide some examples of what technical metadata can be added and where?

Your assumption about the nature of technical metadata versus business metadata in the context of Google Cloud's Data Catalog is partially correct, but there are some nuances worth clarifying.

Technical Metadata:

  • This typically includes details that are automatically extracted and cataloged by the Data Catalog from data sources.Examples include schema information (column names, data types), file formats (CSV, JSON, Parquet), and other system-level details like data sizes and storage locations.
  • Technical metadata is generally considered "read-only" in the sense that it is derived directly from the data systems themselves (e.g., database schemas in BigQuery) and reflects the underlying structure and technical properties of the data assets.

Business Metadata:

  • Business metadata involves contextual information that adds interpretative value to the data, making it more understandable and useful for business users. This might include descriptions of the data set, business terms, data ownership details, and usage guidelines.
  • Unlike technical metadata, business metadata often requires manual entry or can be semi-automated through integrations with business processes that capture this context. It is generally customizable and editable, allowing organizations to tailor the metadata to suit their operational and governance needs.

 

While it's true that most technical metadata is automatically extracted, there are scenarios where technical metadata might be manually added or customized:

  • Custom Schema Definitions: When you integrate data into systems like BigQuery, you often define custom schemas. This schema creation, while it aligns with the technical structure of the data, involves specifying column names, types, and descriptions, which can be considered as adding technical metadata.
  • Data Transformations: During data processing (e.g., in Dataflow or when using Apache Beam scripts), you might transform data formats or create new data structures. Here, you are effectively creating new technical metadata by defining how data should be structured and stored post-transformation.
  • Metadata Enrichment: In some cases, technical metadata can be enriched with additional details to improve its utility or compliance with specific standards. This might involve adding indexing information, annotations about data sensitivity (if considered a technical aspect), or enhanced details on data lineage beyond what is automatically captured.

Understanding the distinction and the flexibility in managing these types of metadata is crucial for effective data governance and usability. While technical metadata provides the foundational structure and understanding of the data environment, business metadata bridges the gap between this data and its practical business applications, enhancing user engagement and data literacy across the organization.

By leveraging the capabilities of the Data Catalog to manage both types of metadata, organizations can ensure that their data assets are not only technically robust and compliant but also aligned with business needs and easily navigable by end-users.