Re: Datastream - PostgreSQL Source and BigQuery De...

nasdenkov · 09-05-2023 01:26 AM

We have a self-managed PostgreSQL v10 database in which multiple tables have their replication identity set to full. Our Datastream stream picks that up and determines that every individual column for such a table must be its primary key when writing it into BigQuery. This causes several problems.

First, Datastream can only handle tables with maximum 16 primary keys. Tables with more than 16 columns are therefore not streamed properly and fail with BIGQUERY_TOO_MANY_PRIMARY_KEYS.

Second, Datastream can only handle primary key columns of specific data types. Because of the assumption that all the columns of the table are primary keys, tables that use interesting data types such as integer arrays or JSONB cannot be properly streamed and fail with BIGQUERY_UNSUPPORTED_TYPE_FOR_PRIMARY_KEY.

In the case where no primary key exists, the current behaviour makes sense. However, Postgres 10 introduces a field rd_pkindex which is present on all change events. I would expect Datastream/BigQuery to use that to identify the primary key of a table, even if the tables' replication identity is set to full.

Is there a way to make Datastream use the primary key of the source table as the primary key in BigQuery for tables that have REPLICA IDENTITY FULL? This would resolve the aforementioned problems.

lsolatorio

Hi @nasdenkov,

Welcome and thank you for reaching out to our community.

I get that you are having challenges with your approach in replicating tables from PostgreSQL to BigQuery.

Ideally, when Datastream creates a BigQuery table, it assumes the primary keys from its source. This is how the UPDATE and DELETE operations are applied in BigQuery. REPLICA IDENTIFY is commonly used for tables without primary keys (which should be avoided), setting it to FULL will somehow simulate the same UPDATE and DELETE operations however it has its limitations and may pose performance issues for it can bulk up the WAL volume.

You might need to revisit your strategy to this replication process. Here are some usable resources that discuss the steps and limitations involved in replication your tables from PostgreSQL to BigQuery.

Hope this helps.

yuriysavchuk

Are there any updates? I faced the same issue too.

dstendardi

Having the same issue. Support for full replica identity would be beneficial, as it would simplify Datastream adoption in scenarios where it needs to coexist with a system that relies on REPLICA IDENTITY FULL."

Datastream - PostgreSQL Source and BigQuery Destination - Handling of REPLICA IDENTITY FULL