Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Automate Security Classification By adding policy tags to BigQuery Tables

Hi,

How can we build a cloud function to read all entries from Data Loss Prevention data profile scans and accordingly tag PII columns with policy tags using data taxonomy?

Do we need to use DLP API for that?

Solved Solved
0 1 721
1 ACCEPTED SOLUTION

Hi @Vishesh1998 ,

Automating the addition of policy tags to BigQuery tables based on DLP data profile scans is a great approach to ensuring security and compliance. You can indeed build a Cloud Function to handle this process. Here's how you can do it:

Overall Workflow

  • Trigger: The Cloud Function is triggered when a DLP data profile scan completes. Use Cloud Pub/Sub to receive notifications of completed scans.
  • Fetch Scan Results: The function uses the DLP API to retrieve the results of the specific data profile scan.
  • Analyze Findings: Parse the scan results, focusing on findings related to Personally Identifiable Information (PII). Identify column names containing PII.
  • Map to Taxonomy: Map the types of PII (e.g., email, phone number, address) to the appropriate policy tags within your data taxonomy.
  • Apply Policy Tags: Use the BigQuery API to apply the relevant policy tags to the columns identified in the scan results.

Here is a sample code snippet to get you started:

 
from google.cloud import bigquery
from google.cloud import dlp_v2

def tag_pii_columns(event, context):
    # 1. Get scan details from Pub/Sub message (event)
    scan_name = event['attributes']['DlpJobName'] 

    # 2. Initialize DLP and BigQuery clients
    dlp_client = dlp_v2.DlpServiceClient()
    bq_client = bigquery.Client()

    # 3. Fetch scan results from DLP
    scan_results = dlp_client.get_dlp_job(name=scan_name)
    inspect_details = scan_results.inspect_details

    # 4. Parse findings and apply policy tags
    for finding in inspect_details.result.info_type_stats:
        info_type_name = finding.info_type.name  # e.g., "US_SOCIAL_SECURITY_NUMBER"
        column_name = finding.field_name  # e.g., "ssn"

        # Map info_type_name to policy tag (replace with your own mapping)
        policy_tag = map_info_type_to_tag(info_type_name) 
        
        # Apply policy tag to BigQuery column (replace with your project ID and dataset)
        table_ref = bq_client.dataset("your_dataset").table("your_table")
        table = bq_client.get_table(table_ref)

        for field in table.schema:
            if field.name == column_name:
                field.policy_tags = bigquery.PolicyTagList([policy_tag])

        bq_client.update_table(table, ["schema"])  # Update table schema

# Helper function to map info_type_name to policy tag (replace with your mapping)
def map_info_type_to_tag(info_type_name):
    # Your custom mapping logic here
    pass

Key Considerations

  • Data Taxonomy: Establish a well-defined data taxonomy with policy tags aligned with your organization's data classification policies.
  • Permissions: Ensure your Cloud Function has the necessary permissions to access DLP scan results, interact with BigQuery, and modify table schemas.
  • Error Handling: Implement robust error handling to manage cases where scans fail, results are not found, or policy tags cannot be applied.
  • Fine-Tuning: Consider adding logic to handle confidence thresholds for PII findings. If the confidence is low, you could choose to review the column manually or apply a less restrictive policy tag.

Need for DLP API

Yes, you will need to use the DLP API to retrieve the results of the data profile scans. The Cloud Function will interact with this API to fetch the detailed findings that guide the application of policy tags.

View solution in original post

1 REPLY 1

Hi @Vishesh1998 ,

Automating the addition of policy tags to BigQuery tables based on DLP data profile scans is a great approach to ensuring security and compliance. You can indeed build a Cloud Function to handle this process. Here's how you can do it:

Overall Workflow

  • Trigger: The Cloud Function is triggered when a DLP data profile scan completes. Use Cloud Pub/Sub to receive notifications of completed scans.
  • Fetch Scan Results: The function uses the DLP API to retrieve the results of the specific data profile scan.
  • Analyze Findings: Parse the scan results, focusing on findings related to Personally Identifiable Information (PII). Identify column names containing PII.
  • Map to Taxonomy: Map the types of PII (e.g., email, phone number, address) to the appropriate policy tags within your data taxonomy.
  • Apply Policy Tags: Use the BigQuery API to apply the relevant policy tags to the columns identified in the scan results.

Here is a sample code snippet to get you started:

 
from google.cloud import bigquery
from google.cloud import dlp_v2

def tag_pii_columns(event, context):
    # 1. Get scan details from Pub/Sub message (event)
    scan_name = event['attributes']['DlpJobName'] 

    # 2. Initialize DLP and BigQuery clients
    dlp_client = dlp_v2.DlpServiceClient()
    bq_client = bigquery.Client()

    # 3. Fetch scan results from DLP
    scan_results = dlp_client.get_dlp_job(name=scan_name)
    inspect_details = scan_results.inspect_details

    # 4. Parse findings and apply policy tags
    for finding in inspect_details.result.info_type_stats:
        info_type_name = finding.info_type.name  # e.g., "US_SOCIAL_SECURITY_NUMBER"
        column_name = finding.field_name  # e.g., "ssn"

        # Map info_type_name to policy tag (replace with your own mapping)
        policy_tag = map_info_type_to_tag(info_type_name) 
        
        # Apply policy tag to BigQuery column (replace with your project ID and dataset)
        table_ref = bq_client.dataset("your_dataset").table("your_table")
        table = bq_client.get_table(table_ref)

        for field in table.schema:
            if field.name == column_name:
                field.policy_tags = bigquery.PolicyTagList([policy_tag])

        bq_client.update_table(table, ["schema"])  # Update table schema

# Helper function to map info_type_name to policy tag (replace with your mapping)
def map_info_type_to_tag(info_type_name):
    # Your custom mapping logic here
    pass

Key Considerations

  • Data Taxonomy: Establish a well-defined data taxonomy with policy tags aligned with your organization's data classification policies.
  • Permissions: Ensure your Cloud Function has the necessary permissions to access DLP scan results, interact with BigQuery, and modify table schemas.
  • Error Handling: Implement robust error handling to manage cases where scans fail, results are not found, or policy tags cannot be applied.
  • Fine-Tuning: Consider adding logic to handle confidence thresholds for PII findings. If the confidence is low, you could choose to review the column manually or apply a less restrictive policy tag.

Need for DLP API

Yes, you will need to use the DLP API to retrieve the results of the data profile scans. The Cloud Function will interact with this API to fetch the detailed findings that guide the application of policy tags.