Solved: Datastream doesn't correctly replicate JSONB field...

hansottowirtz · 07-17-2023 06:06 AM

We're replicating multiple tables from Postgres to BigQuery using Datastream. We use quite a lot of JSONB fields. Some of the JSONB fields are replicated as null, whereas they are not null in the Postgres database, and they've never been null. Backfilling seems to resolve the issue for most rows, but not always. It's seemingly random.

I think this is a Datastream bug. What information would I need to provide to investigate this issue?

The docs mention JSONB fields can't contain more than 2950 nested objects, but our fields are very small, so that's not the case.

hansottowirtz

This issue was already reported: https://issuetracker.google.com/issues/267194799

View solution in original post

ms4446

The information you would need to provide to investigate this issue would include:

The specific queries being employed to transfer data from your Postgres database to BigQuery.
The structure of the tables in both your Postgres and BigQuery databases.
The actual values of the JSONB fields that are being replicated as null in BigQuery.
The outcomes of your backfill operations.

hansottowirtz

1. I have just used the UI in the Google Cloud console to set up Datastream, and made connection profiles and streams with the UI. Should I use the CLI to get more details?

2. This error occurs in our "Orders" table. It has an "address" field which is in JSONB. Out of tens of thousands of rows, 12 of the address fields are null in BigQuery, and they're all not-null in Postgres. There are other JSONB fields that are also replicated as null, but they're all in different rows. There seems to be no correlation.

3. The "address" field looks as follows:

{
  "address": {
    "zip": "2000",
    "city": "Antwerp",
    "line2": "",
    "line3": "",
    "state": "",
    "street": "Sint-Pietersvliet 7",
    "country": "BE",
    "areaLevel2": ""
  },
  "lastName": "Wirtz",
  "firstName": "Hans Otto",
  "phoneNumber": ""
}

4. The backfill operations all succeed. Do you need more specific information on this one?

5. We're not using private connections. I'm not aware where I can find the Datastream version.

ms4446

Yes, using the CLI could provide more granular details that might not be visible through the UI. It can be a more powerful tool for diagnosing issues.

The occurrence of this issue in only 12 out of tens of thousands of rows does hint that it might not be a systemic problem with Datastream, but rather something specific to those rows.

The structure of your "address" field seems standard and shouldn't pose any issues for replication.

The fact that backfill operations are successful is a positive sign, indicating that the data in the "Orders" table is likely not corrupt.

If you're not utilizing private connections, your Datastream version would typically align with your Google Cloud Platform version. You can check your Google Cloud Platform version by executing the command gcloud version. Once you have that, you can cross-reference it with the Datastream version in the Datastream release notes here: https://cloud.google.com/datastream/docs/release-notes.

Given the information you've provided, it seems plausible that there might be something unique about the data in those 12 rows where the "address" field is null. I recommend inspecting the data in those rows for any anomalies.

hansottowirtz

I'm not sure I'm following you, how can my CLI version influence the version of Datastream I'm using? Is it even possible to choose the version?

I'm quite convinced this is an issue somewhere in Datastream, as the backfill seems to solve it and newly added records have the issue until I do a new backfill. I've manually inspected the rows but I can't find any anomalies. I've backfilled again and some of the rows have disappeared.

hansottowirtz

This issue was already reported: https://issuetracker.google.com/issues/267194799

Datastream doesn't correctly replicate JSONB fields from Postgres to BigQuery