Solved: Google Cloud's Datastream: Not replicating deletes...

pythonsanchez · 10-22-2023 11:11 AM

I have a stream setup with Datastream where a PostgreSQL database on Cloud SQL replicates to BigQuery. When data is inserted to the PostgreSQL database, replication to BigQuery has been working just fine.

I went to delete a row from the PostgreSQL database (after running: ALTER TABLE my_table REPLICA IDENTITY FULL; and it's not being replicated and returning an error:

event_code: "UNSUPPORTED_EVENTS_DISCARDED"
message: "Discarded 1 unsupported events for BigQuery destination: project_id.dataset.my_table, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."
object_name: "my_table"

So now, I have 179 rows in the PostgreSQL database and 180 in BigQuery because it's not replicating the delete. Not sure what is going on since I did not change anything with Primary Keys and I'm not adding any Primary Keys...

ms4446

The error message indicates that Datastream encountered an issue with replicating the DELETE operation to BigQuery due to an unsupported primary key change. While BigQuery does not enforce primary key constraints like traditional databases, Datastream uses a primary key or unique identifier to track changes between the source and destination.

The ALTER TABLE statement you executed in PostgreSQL set the REPLICA IDENTITY to FULL, ensuring that the entire row is captured in change events. However, this does not address the primary key discrepancy in BigQuery.

To resolve this issue:

Ensure that the PostgreSQL table has a primary key defined, and that this primary key or unique identifier is also present in the BigQuery table.
Review your Datastream connection settings to ensure that it's configured to handle deletes correctly.
If there are discrepancies in data, consider manually syncing the rows between PostgreSQL and BigQuery to ensure consistency.

It's essential to ensure that the source and destination schemas are in sync and that any changes to the source schema are appropriately reflected in the destination.

View solution in original post

ms4446

The error message indicates that Datastream encountered an issue with replicating the DELETE operation to BigQuery due to an unsupported primary key change. While BigQuery does not enforce primary key constraints like traditional databases, Datastream uses a primary key or unique identifier to track changes between the source and destination.

The ALTER TABLE statement you executed in PostgreSQL set the REPLICA IDENTITY to FULL, ensuring that the entire row is captured in change events. However, this does not address the primary key discrepancy in BigQuery.

To resolve this issue:

Ensure that the PostgreSQL table has a primary key defined, and that this primary key or unique identifier is also present in the BigQuery table.
Review your Datastream connection settings to ensure that it's configured to handle deletes correctly.
If there are discrepancies in data, consider manually syncing the rows between PostgreSQL and BigQuery to ensure consistency.

It's essential to ensure that the source and destination schemas are in sync and that any changes to the source schema are appropriately reflected in the destination.

pythonsanchez

Thank you for the response. I created a Primary Key in the PostgreSQL table but when I start the stream, I am getting a new error now when starting the stream's backfill:

Datastream can't create a table in BigQuery, because the table's primary key has a column with an unsupported data type.

I created the Primary Key in PostgreSQL with:

ALTER TABLE stocks ADD COLUMN id SERIAL PRIMARY KEY

Is this not the correct way to create a primary key for BigQuery?

ms4446

The SERIAL data type in PostgreSQL is not directly supported as a primary key data type for replication to BigQuery using Datastream. Instead, you can use the BIGINT data type in PostgreSQL, which corresponds to INT64 in BigQuery.

To create a primary key with the BIGINT data type in PostgreSQL, you can use the following ALTER TABLE statement:

ALTER TABLE stocks ADD COLUMN id BIGINT PRIMARY KEY;

Once you have created the primary key with the BIGINT data type in PostgreSQL, you should be able to start the stream's backfill without encountering any errors.

pythonsanchez

Thank you for your help, looks like all is working well now with another attempted delete. Going forward, I think I will set the replica identity to default:

ALTER TABLE stocks REPLICA IDENTITY DEFAULT;

Google Cloud's Datastream: Not replicating deletes to BigQuery