Schema changes for incremental table in Dataform

Sourabh17 · 01-19-2024 04:46 AM

Hello Everyone,

We started working on DataForm recently and have below questions:

1. How to handle schema changes for incremental table. i.e. if a new column get added to the source table after the table already have incremental data. How to handle this?

2. How to pass values from Ariflow to DataForm to make it more dynamic.

Thanks!

kolban

1. I am imagining a table that has 2 columns "col1" and "col2". The table now has "col3" added to it. What is it you want to have happen at this point?

2. Have you seen this documentation on Airflow and Dataform integration? (ref) What is it you are looking to achieve that isn't describe there?

Sourabh17

Point 1:

What am I expecting: In the first case when there are two columns; if i am using Dataform, and using Incremental type. How to handle column addition? Via Dataform is there any way to add or remove column in production environment, without managing it externally.

Point 2:

What's the best practice around managing physical tables in Dataform?

For Airflow, i have a pipeline and would like to make it more idempotent and use more of Airflow features like backfilling, is it possible to pass variables from Airflow to Dataform other then directly maintaining it in release config.

yan-hic

It is a limitation of BigQuery, not Dataform. An INSERT does not allow to expand the target table schema based on the values or SELECT statement.

If your source/new data is not in BigQuery yet i.e. GCS, you can use load into BQ target table through the REST API - the SQL LOAD INTO has no SchemaUpdate option. If you are familiar with Cloud Functions, you can write a remote function that does that from within BQ/Dataform.
If source/new data is in BQ already, you can write a script that compares the schemas (through INFORMATION_SCHEMA) and execute the appropriate ALTER TABLE statements.

Sourabh17

Thanks for the answer!

This would be adding an additional step to take care of schema change, which is unavoidable in a long run. I was looking for the feature provided by dbt, as mentioned here

yan-hic

@Sourabh17 I am with you. I have found this old feature request from one of the few maintainers left...

Weird it did not get more traction as schema evolution is indeed a common scenario.

For what it's worth, we have gone around it by saving/exporting each "increment" in Parquet in GCS and created a one-time external table to all. When querying on that table, BigQuery reads the schema of the "latest" file (based on path in alphanumeric order) and uses it to read all partitions. No incremental load necessary.