Solved: Re: How to identify untagged data sets

Beginner · 02-12-2024 06:30 AM

Hi,

I intend to use Dataplex to tag both Big Query(BQ) Datasets and Cloud Storage(CS) Buckets in order to identify data that contains PII and Credit Card Info.

Whenever a new BQ Dataset or CS Bucket is created, is there any way to search for these in Dataplex, so that a tag can be applied to these? In other words, is there a way for Dataplex to highlight any datasets or buckets that do not have any tags applied?

Thanks

ms4446

Identifying untagged BigQuery datasets and Cloud Storage buckets using the Dataplex search function involves leveraging Dataplex's integration with Data Catalog for metadata management and discovery. Currently, Dataplex may not offer a direct "find untagged assets" filter. However, here are effective strategies to achieve this:

Strategies:

Leverage Data Catalog API: For robust control, use the Data Catalog API to search and filter assets within Dataplex. Construct queries focusing on the absence of tags within an asset's metadata for both BigQuery datasets and Cloud Storage buckets.
Custom Scripting: Develop scripts that interact with the Data Catalog API. Fetch a list of all relevant assets managed by Dataplex, and programmatically filter those lacking tags based on metadata checks.
Strategic Data Catalog Search: Utilize Data Catalog's search interface to find a wide range of assets. Then, employ client-side filtering or simple scripts to refine the results to display only those without tags.

Tagging in Dataplex vs. BigQuery/GCS:

Dataplex: Offers centralized tagging with a focus on data governance across multiple data assets and services. Use it to categorize data within Dataplex-managed data lakes and apply higher-level compliance or discovery-oriented tags.
BigQuery / GCS: Allow tagging/labeling directly within the services. These labels often relate to cost tracking, project organization, or operational concerns specific to that service.

View solution in original post

ms4446

Dataplex offers an advanced suite of tools designed to optimize the management, discovery, and governance of data across BigQuery datasets and GCS buckets. This suite is pivotal for tagging and classifying data, crucial steps in effectively identifying and managing sensitive information, such as PII and credit card details. Below is an overview of leveraging Dataplex for these essential functions:

Data Discovery and Classification:

Automated Discovery and Profiling: Leveraging Data Catalog and DLP (Data Loss Prevention), Dataplex automates the discovery and profiling of data assets. This foundational step is critical for identifying metadata and sensitive data, ensuring a thorough classification approach.
Custom Data Classifiers: When specific detection requirements for PII or credit card formats exceed the capabilities of built-in classifiers, custom data classifiers become indispensable. These classifiers, crafted through regular expressions or machine learning models, provide a tailored approach to sensitive data identification.

Filtering and Highlighting Untagged Assets:

Utilizing Dataplex's filtering capabilities allows users to pinpoint datasets and buckets in need of further review by identifying assets without tags. Mastery of the specific syntax and capabilities for filtering within Dataplex, as detailed in the latest documentation, is essential.

Applying Tags:

Manual and Bulk Tagging: Dataplex supports data governance and classification through both manual and bulk tagging capabilities. Understanding the process for tagging within the Dataplex interface and leveraging API capabilities for task automation are key to ensuring efficient compliance and data management.

Automation Considerations:

Iterative Approach: An effective strategy involves starting with Dataplex's built-in tools to establish a baseline of tags, then progressively building automation. Utilizing the Dataplex API in conjunction with Cloud Functions or Cloud Run, in response to events logged in Pub/Sub, facilitates the development of custom tagging logic tailored to specific governance requirements.

Permissions, Refinement, & Staying Updated:

IAM Permissions: It's crucial to ensure that appropriate IAM permissions are set for users to discover, view, and tag assets within Dataplex.
Continuous Refinement: Over time, enhancing your data classification strategy, particularly by incorporating the DLP API directly, fortifies your approach to managing sensitive data.
Best Practices: Maintaining an up-to-date data governance practice necessitates regular consultation of the latest Dataplex and related Google Cloud services documentation to fully leverage the platform's capabilities.

Additional Considerations:

Data Catalog Integration: For advanced search capabilities and in-depth metadata exploration, integrating Data Catalog's UI with Dataplex is recommended.
Compliance and Security: Highlighting Dataplex's contribution to meeting compliance requirements (such as GDPR and CCPA) and enhancing data security is vital.
Collaboration and Access Control: Dataplex fosters secure collaboration among data stakeholders while upholding strict access control and governance policies.
Update and Maintenance Strategy: To adapt to evolving regulations, it's imperative to regularly update and maintain your data governance framework, including periodic reviews of tagging accuracy and classifier effectiveness.

Beginner

Thanks.

How do I specifically identify big query data sets and Cloud Storage buckets that are "not tagged" using the Dataleplex search function?

Also, what is the difference between tagging via Dataplex and tagging via Big Query or Cloud Storage?

ms4446

Identifying untagged BigQuery datasets and Cloud Storage buckets using the Dataplex search function involves leveraging Dataplex's integration with Data Catalog for metadata management and discovery. Currently, Dataplex may not offer a direct "find untagged assets" filter. However, here are effective strategies to achieve this:

Strategies:

Leverage Data Catalog API: For robust control, use the Data Catalog API to search and filter assets within Dataplex. Construct queries focusing on the absence of tags within an asset's metadata for both BigQuery datasets and Cloud Storage buckets.
Custom Scripting: Develop scripts that interact with the Data Catalog API. Fetch a list of all relevant assets managed by Dataplex, and programmatically filter those lacking tags based on metadata checks.
Strategic Data Catalog Search: Utilize Data Catalog's search interface to find a wide range of assets. Then, employ client-side filtering or simple scripts to refine the results to display only those without tags.

Tagging in Dataplex vs. BigQuery/GCS:

Dataplex: Offers centralized tagging with a focus on data governance across multiple data assets and services. Use it to categorize data within Dataplex-managed data lakes and apply higher-level compliance or discovery-oriented tags.
BigQuery / GCS: Allow tagging/labeling directly within the services. These labels often relate to cost tracking, project organization, or operational concerns specific to that service.

Beginner

Thanks, it will be a new challenge for me but I will look into using the Data Catalog API in order to identify dataset and buckets without tags.

Also, thanks for clarifying the difference in tagging via Dataplex and Big Query/GCS.