I have a Datastream set up that is streaming several tables from one of our cloud SQL Postgres databases into a BigQuery dataset.
We were under the impression that Datastream should gracefully handle schema changes. In this case we have a "message" column in one of the tables that changed in Postgres from varchar(250) to varchar(500).
This seems like an easy update in BQ, since the new column is larger and it's the same type (STRING). But it did not update, and now we are getting data longer than 250 chars being dropped as they won't fit into the column.
We cannot of course manually update the table in BQ as there's a stream attached. Is it expected that a minor change like this in the source schema would not propagate to the destination with Datastream? So far I'm not finding documentation on the recommended process to update this. Do I delete this table from the datastream definition, rename the BQ table, and then add the table back into the datastream? I assume at that point I'd have to backfill as there'd be a gap in the data during the time it was not in the stream.
Solved! Go to Solution.
You're correct that Datastream is designed to handle minor schema changes, such as increasing the size of a varchar
field. However, there are instances, as in your case, where these changes in the source schema are not automatically propagated to the corresponding BigQuery schema. This can result in issues like data truncation when the data exceeds the originally defined column size.
While comprehensive documentation on this specific scenario might be limited, there are several workarounds to consider:
Delete and Re-create the Table: You can remove the affected table from the Datastream configuration, rename the existing BigQuery table (to preserve historical data), and then re-add the table to the Datastream. This process will create a new BigQuery table with the updated schema reflecting the current source schema. Be aware that there will be a gap in the data during the period when the table is removed from the stream, and you will need to backfill this data manually.
Create a New Stream: Another option is to set up a new Datastream for the table that has undergone the schema change. This will result in a new BigQuery table with the updated schema, thereby avoiding data truncation issues. Subsequently, you can merge this data with your existing dataset if necessary.
Use a Dataflow Job: For greater control and flexibility, especially with more complex schema changes, consider implementing a Dataflow job. This job can read data from the source PostgreSQL table, apply the necessary schema transformations (like increasing the varchar
size), and then write the data to the BigQuery table. This approach is more complex but can be more effective for specific transformation needs.