Data Loss Prevention

Vishesh1998 · 07-18-2024 11:46 AM

Hey I am using GCP DLP, when I run the discovery scans on the tables they missed out on few columns which does contain PII. But DLP discovery scan output doesn't give me in the results. Can you please tell why is that let's say there are two credit card and address column but DLP only notify me of one not for other for each of them. Please suggest on this. Also can you tell me how to reduce the false positive and false negative in DLP discovery scans.

ms4446

One primary reason DLP might miss PII is due to InfoType detection limitations. DLP relies on predefined patterns to identify sensitive data, and if the PII data doesn't match these patterns or is stored in an unusual format, it might not be detected. Additionally, if the PII is specific to a particular organization or industry, it may require custom InfoTypes for accurate detection.

Another common issue is scanning configuration. It's crucial to ensure that the scan job is correctly configured to include the appropriate tables and InfoTypes. Data volume and sampling also play a role; for large datasets, DLP might use sampling to speed up scans, potentially overlooking some PII. Furthermore, if the PII data is masked or obfuscated, it might be harder for DLP to identify it accurately.

To enhance PII detection and reduce false positives and negatives, refining InfoType detection is essential. Exploring the full list of built-in InfoTypes and creating custom InfoTypes using regular expressions or dictionaries can help match the specific PII patterns. Proximity rules can also be employed to identify sensitive data near other identifiers, such as a "credit card number" near an "expiration date."

Improving scanning configuration is another key strategy. Ensuring that all relevant tables and columns where PII might reside are included in the scan configuration is crucial. Adjusting the sampling rate can help if sampling is suspected to cause missed PII, albeit at the cost of increased scan time.

Data preparation can also play a significant role in accurate PII detection. Temporarily unmasking or de-identifying data during scans can ensure that the PII is correctly identified. Additionally, normalizing data stored in inconsistent formats before scanning can improve detection accuracy.

Iteration and review are important steps in the process. Analyzing false positives helps in refining InfoTypes or rules, while manually reviewing scan findings during the early stages of tuning the DLP configuration can help identify any missed PII.

For instance, consider a dataset with columns like cc_number, cc_number_masked, billing_address, and shipping_address. To ensure all credit card numbers are detected, both the built-in "Credit Card Number" InfoType and a custom InfoType for masked credit card numbers should be used, and both cc_number and cc_number_masked columns should be included in the scan. Similarly, using the "Street Address" and other relevant address InfoTypes, or creating custom InfoTypes for specific address formats, can ensure both billing_address and shipping_address are scanned accurately.