I am trying to execute an already created table via Dataform. I have added a new partition column in the config. I am running using the Full refresh button checked. I am expecting it to drop the table and recreate it. However the script generated by dataform is:
BEGIN
CREATE SCHEMA IF NOT EXISTS `dev-1.d_us_dev` OPTIONS(location="US");
EXCEPTION WHEN ERROR THEN
IF NOT CONTAINS_SUBSTR(@@error.message, "already exists: dataset") AND
NOT CONTAINS_SUBSTR(@@error.message, "too many dataset metadata update operations") AND
NOT CONTAINS_SUBSTR(@@error.message, "User does not have bigquery.datasets.create permission")
THEN
RAISE USING MESSAGE = @@error.message;
END IF;
END;
BEGIN
DECLARE dataform_table_type DEFAULT (
SELECT ANY_VALUE(table_type)
FROM `dev-1.d_us_dev.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 't_union'
); --output is BASE TABLE
IF dataform_table_type IS NOT NULL AND dataform_table_type != 'BASE TABLE' THEN
IF dataform_table_type = 'BASE TABLE' THEN
DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
ELSEIF dataform_table_type = 'VIEW' THEN
DROP VIEW IF EXISTS `dev-1.d_us_dev.t_union`;
ELSEIF dataform_table_type = 'MATERIALIZED VIEW' THEN
DROP MATERIALIZED VIEW IF EXISTS `dev-1.d_us_dev.t_union`;
END IF;
END IF;
BEGIN
CREATE OR REPLACE TABLE `dev-1.d_us_dev.t_union`
PARTITION BY DATETIME_TRUNC(entry_date_utc, HOUR)
CLUSTER BY reference_location
OPTIONS()
AS (
--query text here
);
END;
END;
The output is an error:
Invalid value: Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:hour,field:entry_date_utc) clustering(reference_location) and existing spec is none at [31:3]
So is it a bug that it is not dropping the table or is there is reason for this?
Solved! Go to Solution.
Dataform alongside BigQuery, does not permit the direct alteration of an existing table's partitioning scheme using the CREATE OR REPLACE
statement. This restriction is due to the need for a fundamental reorganization of the table's physical data storage, necessitating its complete reconstruction.
Solution
To introduce a new partition column and apply a new partitioning scheme, follow the outlined steps:
Automate Table Recreation with Dataform:
Modify Your Dataform Script: Incorporate a pre-operation script in your Dataform project to ensure the existing table is dropped if it already exists. This step is crucial for preparing the dataset for the new partitioning configuration.
-- Pre-operation script to drop the table if it exists
DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
Configure Full Refresh: Adjust your table's configuration in Dataform to enable the "Full Refresh" option. This setting instructs Dataform to disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution.
Execute Your Dataform Workflow: With the "Full Refresh" option activated and the pre-operation script in place, proceed to run your Dataform workflow. This process will lead to the dropping of the current table and its subsequent recreation with the desired partitioning setup.
Important Considerations
Data Backup: Prioritize backing up your data before executing operations that involve dropping tables. Utilize BigQuery's table snapshot feature for a straightforward and reliable data restoration method, if necessary.
Understanding Partitioning: Enhance your table optimization efforts by deepening your understanding of BigQuery's partitioning capabilities. Detailed information and best practices can be found in BigQuery's documentation on partitioned tables: BigQuery Partitioned Tables.
Leveraging Dataform's Features: This methodology takes full advantage of Dataform's capabilities, including version control, code review, and automated scheduling. These features ensure that your data transformation workflows are not only reproducible but also meticulously documented.
Dataform alongside BigQuery, does not permit the direct alteration of an existing table's partitioning scheme using the CREATE OR REPLACE
statement. This restriction is due to the need for a fundamental reorganization of the table's physical data storage, necessitating its complete reconstruction.
Solution
To introduce a new partition column and apply a new partitioning scheme, follow the outlined steps:
Automate Table Recreation with Dataform:
Modify Your Dataform Script: Incorporate a pre-operation script in your Dataform project to ensure the existing table is dropped if it already exists. This step is crucial for preparing the dataset for the new partitioning configuration.
-- Pre-operation script to drop the table if it exists
DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
Configure Full Refresh: Adjust your table's configuration in Dataform to enable the "Full Refresh" option. This setting instructs Dataform to disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution.
Execute Your Dataform Workflow: With the "Full Refresh" option activated and the pre-operation script in place, proceed to run your Dataform workflow. This process will lead to the dropping of the current table and its subsequent recreation with the desired partitioning setup.
Important Considerations
Data Backup: Prioritize backing up your data before executing operations that involve dropping tables. Utilize BigQuery's table snapshot feature for a straightforward and reliable data restoration method, if necessary.
Understanding Partitioning: Enhance your table optimization efforts by deepening your understanding of BigQuery's partitioning capabilities. Detailed information and best practices can be found in BigQuery's documentation on partitioned tables: BigQuery Partitioned Tables.
Leveraging Dataform's Features: This methodology takes full advantage of Dataform's capabilities, including version control, code review, and automated scheduling. These features ensure that your data transformation workflows are not only reproducible but also meticulously documented.
Thanks MS for your reply. As you say in your point number 2, I was expecting it to "disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution".
I have now added a pre_operation:
pre_operations {
IF ${incremental()} = false THEN
EXECUTE IMMEDIATE 'DROP TABLE IF EXISTS ${self()}';
END IF;
}