Solved: Re: Full refresh fails when a new/different partit...

denzilj · 02-13-2024 09:15 AM

I am trying to execute an already created table via Dataform. I have added a new partition column in the config. I am running using the Full refresh button checked. I am expecting it to drop the table and recreate it. However the script generated by dataform is:

BEGIN
  CREATE SCHEMA IF NOT EXISTS `dev-1.d_us_dev` OPTIONS(location="US");
EXCEPTION WHEN ERROR THEN
  IF NOT CONTAINS_SUBSTR(@@error.message, "already exists: dataset") AND
    NOT CONTAINS_SUBSTR(@@error.message, "too many dataset metadata update operations") AND
    NOT CONTAINS_SUBSTR(@@error.message, "User does not have bigquery.datasets.create permission")
  THEN
    RAISE USING MESSAGE = @@error.message;
  END IF;
END;

BEGIN
  DECLARE dataform_table_type DEFAULT (
  SELECT ANY_VALUE(table_type)
  FROM `dev-1.d_us_dev.INFORMATION_SCHEMA.TABLES`
  WHERE table_name = 't_union'
  ); --output is BASE TABLE

  IF dataform_table_type IS NOT NULL AND dataform_table_type != 'BASE TABLE' THEN
    IF dataform_table_type = 'BASE TABLE' THEN
      DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
    ELSEIF dataform_table_type = 'VIEW' THEN
      DROP VIEW IF EXISTS `dev-1.d_us_dev.t_union`;
    ELSEIF dataform_table_type = 'MATERIALIZED VIEW' THEN
      DROP MATERIALIZED VIEW IF EXISTS `dev-1.d_us_dev.t_union`;
    END IF;
  END IF;
  
  BEGIN
        
  CREATE OR REPLACE TABLE `dev-1.d_us_dev.t_union`
    PARTITION BY DATETIME_TRUNC(entry_date_utc, HOUR)
    CLUSTER BY reference_location
    OPTIONS()
    AS (
        --query text here

    );
        
  END;
END;

The output is an error:

Invalid value: Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:hour,field:entry_date_utc) clustering(reference_location) and existing spec is none at [31:3]

So is it a bug that it is not dropping the table or is there is reason for this?

ms4446

Dataform alongside BigQuery, does not permit the direct alteration of an existing table's partitioning scheme using the CREATE OR REPLACE statement. This restriction is due to the need for a fundamental reorganization of the table's physical data storage, necessitating its complete reconstruction.

Solution

To introduce a new partition column and apply a new partitioning scheme, follow the outlined steps:

Automate Table Recreation with Dataform:

Modify Your Dataform Script: Incorporate a pre-operation script in your Dataform project to ensure the existing table is dropped if it already exists. This step is crucial for preparing the dataset for the new partitioning configuration.

-- Pre-operation script to drop the table if it exists

DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
Configure Full Refresh: Adjust your table's configuration in Dataform to enable the "Full Refresh" option. This setting instructs Dataform to disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution.
Execute Your Dataform Workflow: With the "Full Refresh" option activated and the pre-operation script in place, proceed to run your Dataform workflow. This process will lead to the dropping of the current table and its subsequent recreation with the desired partitioning setup.

Important Considerations

Data Backup: Prioritize backing up your data before executing operations that involve dropping tables. Utilize BigQuery's table snapshot feature for a straightforward and reliable data restoration method, if necessary.
Understanding Partitioning: Enhance your table optimization efforts by deepening your understanding of BigQuery's partitioning capabilities. Detailed information and best practices can be found in BigQuery's documentation on partitioned tables: BigQuery Partitioned Tables.
Leveraging Dataform's Features: This methodology takes full advantage of Dataform's capabilities, including version control, code review, and automated scheduling. These features ensure that your data transformation workflows are not only reproducible but also meticulously documented.

View solution in original post

ms4446

Dataform alongside BigQuery, does not permit the direct alteration of an existing table's partitioning scheme using the CREATE OR REPLACE statement. This restriction is due to the need for a fundamental reorganization of the table's physical data storage, necessitating its complete reconstruction.

Solution

To introduce a new partition column and apply a new partitioning scheme, follow the outlined steps:

Automate Table Recreation with Dataform:

Modify Your Dataform Script: Incorporate a pre-operation script in your Dataform project to ensure the existing table is dropped if it already exists. This step is crucial for preparing the dataset for the new partitioning configuration.

-- Pre-operation script to drop the table if it exists

DROP TABLE IF EXISTS `dev-1.d_us_dev.t_union`;
Configure Full Refresh: Adjust your table's configuration in Dataform to enable the "Full Refresh" option. This setting instructs Dataform to disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution.
Execute Your Dataform Workflow: With the "Full Refresh" option activated and the pre-operation script in place, proceed to run your Dataform workflow. This process will lead to the dropping of the current table and its subsequent recreation with the desired partitioning setup.

Important Considerations

Data Backup: Prioritize backing up your data before executing operations that involve dropping tables. Utilize BigQuery's table snapshot feature for a straightforward and reliable data restoration method, if necessary.
Understanding Partitioning: Enhance your table optimization efforts by deepening your understanding of BigQuery's partitioning capabilities. Detailed information and best practices can be found in BigQuery's documentation on partitioned tables: BigQuery Partitioned Tables.
Leveraging Dataform's Features: This methodology takes full advantage of Dataform's capabilities, including version control, code review, and automated scheduling. These features ensure that your data transformation workflows are not only reproducible but also meticulously documented.

denzilj

Thanks MS for your reply. As you say in your point number 2, I was expecting it to "disregard the existing table structure, allowing for the recreation of the table with the updated partitioning scheme upon the next workflow execution".

I have now added a pre_operation:

pre_operations {
  IF ${incremental()} = false THEN
    EXECUTE IMMEDIATE 'DROP TABLE IF EXISTS ${self()}';
  END IF;
}

which works perfectly. Thanks.

Full refresh fails when a new/different partition column is specified