Datastream - PostgreSQL Source and BigQuery Destin... - Page 2

nasdenkov · 09-05-2023 01:26 AM

We have a self-managed PostgreSQL v10 database in which multiple tables have their replication identity set to full. Our Datastream stream picks that up and determines that every individual column for such a table must be its primary key when writing it into BigQuery. This causes several problems.

First, Datastream can only handle tables with maximum 16 primary keys. Tables with more than 16 columns are therefore not streamed properly and fail with BIGQUERY_TOO_MANY_PRIMARY_KEYS.

Second, Datastream can only handle primary key columns of specific data types. Because of the assumption that all the columns of the table are primary keys, tables that use interesting data types such as integer arrays or JSONB cannot be properly streamed and fail with BIGQUERY_UNSUPPORTED_TYPE_FOR_PRIMARY_KEY.

In the case where no primary key exists, the current behaviour makes sense. However, Postgres 10 introduces a field rd_pkindex which is present on all change events. I would expect Datastream/BigQuery to use that to identify the primary key of a table, even if the tables' replication identity is set to full.

Is there a way to make Datastream use the primary key of the source table as the primary key in BigQuery for tables that have REPLICA IDENTITY FULL? This would resolve the aforementioned problems.

Datastream - PostgreSQL Source and BigQuery Destination - Handling of REPLICA IDENTITY FULL