Datastream occasionally ingests columns that are n...

farrukh · 11-14-2024 03:47 AM

I have a datastream stream that ingests data from MySQL to Google Cloud Storage. The deployment is managed by terraform. Datastream is configured to include certain columns in certain tables from the MySQL to avoid ingestion of columns that contain sensitive data. The total count of columns that should be included across all tables in terraform matches the count in Datastream UI, under Overview -> Properties -> Objects to include (see the screenshot below) so I assume that should rule out the possibility that something is wrong with object inclusion in terraform.

Screenshot 2024-11-14 at 13.27.51.png

However I've noticed that sometimes in some of the partitions (not all of them though) in cloud storage bucket datastream still ingests these "sensitive" columns that are not included and shouldn't be ingested at all. This is inconsistent as the sensitive data appears in some partitions randomly, but not in all of them.

This also causes errors in my BigQuery external tables because BigQuery encounters fields that are not defined in the external table schema:

Error while reading table: REDACTED, error message: JSON parsing error in row starting at position 0: No such field: payload.REDACTED_SENSITIVE_COLUMN_NAME. File: gs://REDACTED/REDACTED_mysql-cdc-binlog_-1948622057_5_10459722.jsonl

Is there a way to fix this issue in datastream to make sure it only ingests columns that are included, or am I missing something?

-Rhett

Hi @farrukh

Welcome to the Google Cloud Community!

We’re aware of similar reports about inconsistencies with column inclusions or exclusions in Datastream, and we’re working to address this. You can follow updates on this by subscribing to the issue tracker here and adding a +1 to prioritize visibility.

In the meantime, I recommend the following steps:

Manually exclude the columns you don’t need using a custom schema.
Refresh the schema for your stream using the Datastream console to ensure it matches the MySQL source.
If the issue persists, create a new stream to test if it replicates the behavior.
Use partitioned BigQuery tables and validate the schema against the Cloud Storage structure after changes.
Since you’re working with sensitive data, consider using Sensitive Data Protection to de-identify your data source or de-identify sensitive Cloud Storage data.

If none of these resolve the issue, you can file a detailed bug report through the Issue Tracker. Include steps to replicate, configuration details, and logs to help us investigate. While there isn’t a specific timeframe for resolution, once we've fixed an issue in production we'll indicate this and then update and close the bug.

I hope this helps!

Datastream occasionally ingests columns that are not included