Solved: Re: Schema not updating in Datastream destination ...

Renee_Cooper · 12-08-2023 11:24 AM

I have a Datastream set up that is streaming several tables from one of our cloud SQL Postgres databases into a BigQuery dataset.

We were under the impression that Datastream should gracefully handle schema changes. In this case we have a "message" column in one of the tables that changed in Postgres from varchar(250) to varchar(500).

This seems like an easy update in BQ, since the new column is larger and it's the same type (STRING). But it did not update, and now we are getting data longer than 250 chars being dropped as they won't fit into the column.

We cannot of course manually update the table in BQ as there's a stream attached. Is it expected that a minor change like this in the source schema would not propagate to the destination with Datastream? So far I'm not finding documentation on the recommended process to update this. Do I delete this table from the datastream definition, rename the BQ table, and then add the table back into the datastream? I assume at that point I'd have to backfill as there'd be a gap in the data during the time it was not in the stream.

ms4446

You're correct that Datastream is designed to handle minor schema changes, such as increasing the size of a varchar field. However, there are instances, as in your case, where these changes in the source schema are not automatically propagated to the corresponding BigQuery schema. This can result in issues like data truncation when the data exceeds the originally defined column size.

While comprehensive documentation on this specific scenario might be limited, there are several workarounds to consider:

Delete and Re-create the Table: You can remove the affected table from the Datastream configuration, rename the existing BigQuery table (to preserve historical data), and then re-add the table to the Datastream. This process will create a new BigQuery table with the updated schema reflecting the current source schema. Be aware that there will be a gap in the data during the period when the table is removed from the stream, and you will need to backfill this data manually.
Create a New Stream: Another option is to set up a new Datastream for the table that has undergone the schema change. This will result in a new BigQuery table with the updated schema, thereby avoiding data truncation issues. Subsequently, you can merge this data with your existing dataset if necessary.
Use a Dataflow Job: For greater control and flexibility, especially with more complex schema changes, consider implementing a Dataflow job. This job can read data from the source PostgreSQL table, apply the necessary schema transformations (like increasing the varchar size), and then write the data to the BigQuery table. This approach is more complex but can be more effective for specific transformation needs.

View solution in original post

ms4446

You're correct that Datastream is designed to handle minor schema changes, such as increasing the size of a varchar field. However, there are instances, as in your case, where these changes in the source schema are not automatically propagated to the corresponding BigQuery schema. This can result in issues like data truncation when the data exceeds the originally defined column size.

While comprehensive documentation on this specific scenario might be limited, there are several workarounds to consider:

Delete and Re-create the Table: You can remove the affected table from the Datastream configuration, rename the existing BigQuery table (to preserve historical data), and then re-add the table to the Datastream. This process will create a new BigQuery table with the updated schema reflecting the current source schema. Be aware that there will be a gap in the data during the period when the table is removed from the stream, and you will need to backfill this data manually.
Create a New Stream: Another option is to set up a new Datastream for the table that has undergone the schema change. This will result in a new BigQuery table with the updated schema, thereby avoiding data truncation issues. Subsequently, you can merge this data with your existing dataset if necessary.
Use a Dataflow Job: For greater control and flexibility, especially with more complex schema changes, consider implementing a Dataflow job. This job can read data from the source PostgreSQL table, apply the necessary schema transformations (like increasing the varchar size), and then write the data to the BigQuery table. This approach is more complex but can be more effective for specific transformation needs.

sampersoon

To elaborate on the above, is there a reason the `varchar` size is taken over in the first place? As this is quite limiting and schema changes about contraints might happen.

For example a `uuid` is mapped to a `string` and not a `string(36)`, you could argue to do the same for any `varchar` field.

You can model complex constraints in your Postgres database that are not reflected in Big Query, I would argue that the size of a `varchar` is such a constraint. It makes sense that you can't change from `string` to `float64` as that means changing the type, but the size is a constraint on the value of that given type.

ms4446

Currently, Google Datastream replicates varchar size constraints directly from source databases such as PostgreSQL into BigQuery. While this method ensures data integrity, it can introduce challenges when source schemas evolve. To address these challenges and improve the flexibility and user-friendliness of schema management in Datastream, the following feature suggestions are proposed:

Schema Change Monitoring and Alerts: I believe the most impactful enhancement in the short-term would be a robust mechanism for monitoring schema changes in the source database, including changes to varchar sizes. Proactive alerts about discrepancies between the source and destination schemas would significantly reduce the risk of unexpected data issues and give users time for informed adjustments.
Smart Defaults: Implement a feature that analyzes sample data within varchar fields to infer an appropriate default size for BigQuery columns. This would simplify the initial setup and reduce manual adjustments, especially as source schemas change. Users could still override the defaults where needed.
Advanced Constraint Analysis: Develop a more sophisticated system for interpreting and translating constraints from the source database. For example, recognizing when a size constraint on a varchar field could be safely relaxed in BigQuery without compromising data integrity would provide welcome flexibility.

Schema not updating in Datastream destination table