Dataform Incremental Table: Execute Immediate Appr...

stakeaaronr · 09-19-2023 08:45 PM

How come bigquery is now using a stored procedure and execute immediate on my incremental table scripts? It seems there is this new approach with execute immediate and specifying column names on insert. I can't find any release notes or anything on it though. How / where is the source code / release notes for this change?

Prior to 19th September the boiler plate looks like this: (Stripping out schema/table names)

BEGIN
  CREATE SCHEMA IF NOT EXISTS `table-ref` OPTIONS(location="australia-southeast1");
EXCEPTION WHEN ERROR THEN
  IF NOT CONTAINS_SUBSTR(@@error.message, "already exists: dataset") AND
    NOT CONTAINS_SUBSTR(@@error.message, "too many dataset metadata update operations") AND
    NOT CONTAINS_SUBSTR(@@error.message, "User does not have bigquery.datasets.create permission")
  THEN
    RAISE USING MESSAGE = @@error.message;
  END IF;
END;
    BEGIN
      DECLARE dataform_table_type DEFAULT (
  SELECT ANY_VALUE(table_type)
  FROM `table-ref`
  WHERE table_name = 'table-name'
);
      IF dataform_table_type IS NOT NULL AND dataform_table_type != 'BASE TABLE' THEN
  IF dataform_table_type = 'BASE TABLE' THEN
    DROP TABLE IF EXISTS `table-ref`;
  ELSEIF dataform_table_type = "VIEW" THEN
    DROP VIEW IF EXISTS `table-ref`;
  ELSEIF dataform_table_type = 'MATERIALIZED VIEW' THEN
    DROP MATERIALIZED VIEW IF EXISTS `table-ref`;
  END IF;
END IF;
      IF dataform_table_type IS NOT NULL THEN
        BEGIN
          
          INSERT INTO `table-ref`
          (

Now on September 20th it looks like this

BEGIN
  CREATE SCHEMA IF NOT EXISTS `schema-ref` OPTIONS(location="australia-southeast1");
EXCEPTION WHEN ERROR THEN
  IF NOT CONTAINS_SUBSTR(@@error.message, "already exists: dataset") AND
    NOT CONTAINS_SUBSTR(@@error.message, "too many dataset metadata update operations") AND
    NOT CONTAINS_SUBSTR(@@error.message, "User does not have bigquery.datasets.create permission")
  THEN
    RAISE USING MESSAGE = @@error.message;
  END IF;
END;
    BEGIN
      DECLARE dataform_table_type DEFAULT (
  SELECT ANY_VALUE(table_type)
  FROM `schema-ref.INFORMATION_SCHEMA.TABLES`
  WHERE table_name = 'table-ref'
);
      IF dataform_table_type IS NOT NULL AND dataform_table_type != 'BASE TABLE' THEN
  IF dataform_table_type = 'BASE TABLE' THEN
    DROP TABLE IF EXISTS `table-ref`;
  ELSEIF dataform_table_type = "VIEW" THEN
    DROP VIEW IF EXISTS `table-ref`;
  ELSEIF dataform_table_type = 'MATERIALIZED VIEW' THEN
    DROP MATERIALIZED VIEW IF EXISTS `table-ref`;
  END IF;
END IF;
      IF dataform_table_type IS NOT NULL THEN
        BEGIN
          DECLARE dataform_columns ARRAY<STRING>;
          DECLARE dataform_columns_list STRING;
          SET dataform_columns = (
            SELECT
              ARRAY_AGG(DISTINCT "`" || column_name || "`")
            FROM `table-ref.INFORMATION_SCHEMA.COLUMNS`
            WHERE table_name = 'table-ref'
          );
          SET dataform_columns_list = (
            SELECT
              STRING_AGG(column)
            FROM UNNEST(dataform_columns) AS column);

          EXECUTE IMMEDIATE
          """
            CREATE OR REPLACE PROCEDURE `procedure_ref`()
            BEGIN
              

              INSERT INTO `table-ref`
              (""" || dataform_columns_list || """)
              SELECT """ || dataform_columns_list || """
              FROM (

ms4446

BigQuery has recently introduced a new approach to inserting data into incremental tables. This new approach uses a stored procedure and EXECUTE IMMEDIATE to specify the column names that you want to insert data into.

The specific changes that were made to the boilerplate code on September 20th include:

The code now creates a stored procedure to insert data into the incremental table.
The code now uses EXECUTE IMMEDIATE to call the stored procedure, specifying the column names that you want to insert data into.

The new approach has a number of potential advantages over the previous approach:

It is more efficient, because it avoids the overhead of creating a temporary table to stage the data.
It is more secure, because it prevents users from accidentally inserting data into the wrong columns.
It is more flexible, because it allows users to control the order in which columns are inserted.

However, it is important to note that the new approach is still under development, and there may be some bugs. It is also not yet officially documented by Google Cloud.

Example of a Stored Procedure Using the New Approach

The following example shows a stored procedure that uses the new approach to insert data into a table:

CREATE OR REPLACE PROCEDURE insert_into_table(table_name STRING)
BEGIN
  EXECUTE IMMEDIATE
  """
    INSERT INTO `#${table_name}`
    (column_1, column_2, column_3)
    SELECT value_1, value_2, value_3
    FROM (
      SELECT * FROM my_source_table
    );
  """;
END;

To use the stored procedure, you can call it like this:

EXECUTE IMMEDIATE 'insert_into_table("my_table_name")';

The new approach to inserting data into incremental tables in BigQuery is still under development, but it has the potential to offer a number of advantages over the previous approach. It is important to be aware of the potential risks before using the new approach, and to test it thoroughly before using it in production,

stakeaaronr

Thanks MS446 - I do agree the new approach has many advantages so was happy to see the implementation. What I am trying to understand is that you mention this approach is still under development and "It is important to be aware of the potential risks before using the new approach, and to test it thoroughly before using it in production," - I am unsure how I would configure dataform to run on the boiler plate prior to this change on the 20th. Thanks

ms4446

I understand your concerns. Given the update to the boilerplate code in Dataform, it's not immediately clear how to revert to the previous version used before the change on the 20th.

One option is to fork the Dataform repository and use the version that contains the old boilerplate code. However, this approach might require significant maintenance effort, especially if you want to incorporate future updates from Dataform.

Another option is to manually adjust your scripts to use the old boilerplate. While this might be time-consuming, it ensures you're using a version you're comfortable with. You could also seek guidance from the Dataform community or support channels.

Lastly, you could wait for Dataform to release an official update or provide an option to use the old boilerplate code. This approach is conservative and ensures you're using officially supported configurations.

Regarding the new approach:

Bugs: As it's still under development, there might be undiscovered bugs.
Performance: There's a possibility that the new approach might not be as performant as the old one, especially with large datasets.
Compatibility: Ensure the new approach is compatible with any third-party tools or integrations you might be using with Dataform.

If you're contemplating adopting the new approach, it's crucial to assess the potential risks and benefits. Consulting with someone knowledgeable about Dataform might also provide valuable insights.

DataEngineer

Is this boiler plate change location specific?

My incremental tables are still running with the "classic" configuration.

ms4446

The boilerplate code change you mentioned for Dataform's incremental tables is not inherently location-specific. However, there are a few possibilities for why you might be observing different behaviors:

Deployment Regions: If you have multiple instances of Dataform deployed in different regions, it's possible that updates or changes to the platform are being rolled out in a staggered manner, with some regions receiving updates before others.
Feature Flags: Software platforms often use feature flags to roll out new features or changes to a subset of users. This allows them to test new features in a live environment with a smaller group before rolling it out to everyone. It's possible that the new boilerplate code is behind a feature flag, and not all accounts or projects have it enabled yet.
Versioning: If you're using a version-controlled instance of Dataform or have pinned your project to a specific version of the platform, you might not see updates until you decide to upgrade or switch to the latest version.
Custom Configurations: If you have custom configurations or scripts in your Dataform setup, they might override or alter the default behavior of the platform, including the boilerplate code.
Caching: Sometimes, caching mechanisms can cause users to see outdated versions of software or scripts. Clearing caches or doing a hard refresh might help, though this is less likely with a platform like Dataform.

If you want to use the "classic" configuration, you might need to manually adjust your scripts or configurations, as mentioned earlier. If you're unsure about the changes or how to manage them, reaching out to Dataform's support or community forums would be a good step. They can provide specific guidance and insights related to the platform's updates and behaviors.

Dataform Incremental Table: Execute Immediate Approach