Datastream occasionally ingests columns that are n... - Page 2

farrukh · 11-14-2024 03:47 AM

I have a datastream stream that ingests data from MySQL to Google Cloud Storage. The deployment is managed by terraform. Datastream is configured to include certain columns in certain tables from the MySQL to avoid ingestion of columns that contain sensitive data. The total count of columns that should be included across all tables in terraform matches the count in Datastream UI, under Overview -> Properties -> Objects to include (see the screenshot below) so I assume that should rule out the possibility that something is wrong with object inclusion in terraform.

Screenshot 2024-11-14 at 13.27.51.png

However I've noticed that sometimes in some of the partitions (not all of them though) in cloud storage bucket datastream still ingests these "sensitive" columns that are not included and shouldn't be ingested at all. This is inconsistent as the sensitive data appears in some partitions randomly, but not in all of them.

This also causes errors in my BigQuery external tables because BigQuery encounters fields that are not defined in the external table schema:

Error while reading table: REDACTED, error message: JSON parsing error in row starting at position 0: No such field: payload.REDACTED_SENSITIVE_COLUMN_NAME. File: gs://REDACTED/REDACTED_mysql-cdc-binlog_-1948622057_5_10459722.jsonl

Is there a way to fix this issue in datastream to make sure it only ingests columns that are included, or am I missing something?

Datastream occasionally ingests columns that are not included