Solved: Re: Use case...when...then...end statement in stor...

sysph · 03-15-2024 07:35 AM

Hi there,

hope you can help me with this. I'm trying to use stored procedure to create flatten query and data.

I have this query to fetch all events and event parameters

with events as (
  SELECT
    event_name,
    param.key as name,
    CASE
      WHEN param.value.string_value is not null THEN "string"
      WHEN param.value.int_value is not null THEN "int"
      WHEN param.value.double_value is not null THEN "double"
      WHEN param.value.float_value is not null THEN "float"
    END as value
   FROM `xx.analytics_xxxx.events_*`,
  UNNEST(event_params) as param
  WHERE _TABLE_SUFFIX between '20240315' and '20240315'
  --DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
  GROUP BY event_name, name, value
)
 
 
select event_name, dimension.name, dimension.value from (
SELECT
  event_name,
  ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension

And i also have a flatten query

WITH analytics AS (
SELECT
    event_name
    , event_date
    , event_timestamp
    , TIMESTAMP_MICROS(event_timestamp) as event_datetime
    , event_previous_timestamp
    , event_value_in_usd
    , event_bundle_sequence_id
    , event_server_timestamp_offset
    , user_id
    , user_pseudo_id
    ------------------------------- Event Parameters ------------------------------
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Appearance') AS Appearance
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'CollectionSet') AS CollectionSet
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Connectivity') AS Connectivity
...
FROM `ga-xxx.analytics_xxx.events_*`
    --, UNNEST(items) AS items
    WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315' 

)
, analytics_updated AS (
    SELECT
    *
    --CONCAT(ga_session_id,".",user_pseudo_id) AS unique_session_id
    --, *
    FROM analytics
)
SELECT * FROM analytics_updated

I'm trying to combine these 2 together. I use for...end for; to loop through the parameters in order to auto generate the select statement in the query. Here's what i have the moment, however, i'm stucked at the case ...when...then...end statement. How do i join the value into the query string (set value_type_statement = ...) into flatten_statement var? Thanks in advance.

create or replace procedure `ga-xx.sandbox_xx`.proc_app ()
begin
declare value_type_statement string;
declare parameters_statement string;
declare flatten_statement string; 


set flatten_statement = "select "
    || "event_name"
    || ", event_date"
    || ", event_timestamp"
    || ", TIMESTAMP_MICROS(event_timestamp) as event_datetime"
    || ", event_previous_timestamp"
    || ", event_value_in_usd"
    || ", event_bundle_sequence_id"
    || ", event_server_timestamp_offset"
    || ", user_id"
    || ", user_pseudo_id" ;

for iterator in (with events as (
  SELECT
    event_name,
    param.key as name,
    CASE
      WHEN param.value.string_value is not null THEN "string"
      WHEN param.value.int_value is not null THEN "int"
      WHEN param.value.double_value is not null THEN "double"
      WHEN param.value.float_value is not null THEN "float"
    END as value
   FROM `ga-xx.analytics_xxx.events_*`,
  UNNEST(event_params) as param
  WHERE _TABLE_SUFFIX between '20240315' and '20240315'
  --DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
  GROUP BY event_name, name, value
)
select event_name, dimension.name, dimension.value from (
SELECT
  event_name,
  ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension)
do

  -- IF( iterator.value== "string" , 'true', 'false') AS result
    -- CASE
    --   WHEN "string" THEN "(SELECT value.string_value "
    --   WHEN "int" THEN "(SELECT value.int_value "
    --   WHEN "double" THEN "(SELECT value.double_value "
    --   WHEN "float" THEN "(SELECT value.float_value "
    -- END as value
  IF (iterator.value = "string") THEN set value_type_statement = ' (SELECT value.string_value '; end if;
  IF (iterator.value = "int") THEN set value_type_statement = ' (SELECT value.int_value '; end if; 
  IF (iterator.value = "double") THEN set value_type_statement = ' (SELECT value.double_value '; end if;
  IF (iterator.value = "float") THEN set value_type_statement = ' (SELECT value.float_value '; end if;

  set flatten_statement = flatten_statement 
    || value_type_statement 
    || " FROM UNNEST(event_params) WHERE key = " || iterator.name || "AS" || iterator.name;

end for;

set flatten_statement = flatten_statement;
end;

--call `ga-xx.sandbox_xx`.proc_app ();

ms4446

To build a query string dynamically within a BigQuery stored procedure, especially when you need to handle different data types with conditional logic, you will have to piece together your SQL string by checking each condition and appending the appropriate part of the SQL based on the outcome. While you were thinking of using a CASE statement within a loop to handle these conditions, BigQuery's scripting doesn't support directly assigning a variable's value with a CASE statement outside a SELECT query. A more straightforward approach is to use IF statements or to integrate the CASE statement within the construction of your query string itself.

Here's a revised approach to build your flatten_statement string within the loop, using IF statements for clarity and simplicity:

CREATE OR REPLACE PROCEDURE `ga-xxx.sandbox_x`.proc_app() 
BEGIN 
  DECLARE value_type_statement STRING DEFAULT ''; 
  DECLARE flatten_statement STRING DEFAULT 'SELECT 
                                               event_name, 
                                               event_date, 
                                               event_timestamp, 
                                               TIMESTAMP_MICROS(event_timestamp) AS event_datetime, 
                                               event_previous_timestamp, 
                                               event_value_in_usd, 
                                               event_bundle_sequence_id, 
                                               event_server_timestamp_offset, 
                                               user_id, 
                                               user_pseudo_id'; 

  -- Assuming you want to append to the flatten_statement for each parameter 
  FOR iterator IN ( 
       WITH events AS ( 
            SELECT 
               event_name, 
               param.key AS name, 
               CASE 
                 WHEN param.value.string_value IS NOT NULL THEN 'string' 
                 WHEN param.value.int_value IS NOT NULL THEN 'int' 
                 WHEN param.value.double_value IS NOT NULL THEN 'double' 
                 WHEN param.value.float_value IS NOT NULL THEN 'float' 
                 END AS value 
            FROM `ga-xx.analytics_xxx.events_*`, 
                 UNNEST(event_params) AS param 
            WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315' 
            GROUP BY event_name, name, value 
       ) 
       SELECT event_name, dimension.name, dimension.value 
       FROM ( SELECT event_name, ARRAY_AGG(STRUCT(name, value) ORDER BY name) AS attribute 
              FROM events 
              GROUP BY event_name 
              ORDER BY event_name ), 
            UNNEST(attribute) AS dimension 
  ) DO 
    SET value_type_statement = CASE iterator.value 
                                 WHEN 'string' THEN 'STRING' 
                                 WHEN 'int' THEN 'INT64' 
                                 WHEN 'double' THEN 'FLOAT64' 
                                 WHEN 'float' THEN 'FLOAT64' 
                                 ELSE 'STRING' -- Default case if needed 
                               END; 

    -- Append to the flatten_statement for each parameter 
    SET flatten_statement = CONCAT(flatten_statement, 
                                   ', (SELECT value.', value_type_statement, '_value 
                                       FROM UNNEST(event_params) 
                                       WHERE key = ''', iterator.name, ''') AS ', iterator.name); 
  END FOR; 

  -- Debug: Output the constructed query (remove in production) 
  SELECT flatten_statement; 
END; 

 

View solution in original post

ms4446

Here are several strategies to enhance performance and maintainability:

1. Minimize Dynamic SQL Construction Complexity

Dynamic SQL creation can be computationally expensive. Simplify the logic to reduce the complexity and number of iterations within your procedure.

2. Use Pre-aggregated Metadata

Create a table storing metadata about event parameters and user properties (data types, extraction logic).
This pre-computed data makes dynamic SQL generation simpler and faster.

3. Optimize Iteration Logic

Group similar operations to minimize iterations.
Handle cases with shared SQL generation logic together.

4. Leverage BigQuery Functions

BigQuery's built-in functions often simplify complex logic.
Consider using functions flexible with multiple data types rather than dynamic type-based cases.

5. Streamline SQL String Concatenation

Use efficient string operations or minimize the need for concatenation entirely.

Without altering the core logic of your procedure significantly, here are a few adjustments for potential optimizations:

 

-- Example of a simplified approach (conceptual, adjust as needed): 
CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app(startDate STRING, endDate STRING, OUT sql STRING) 
BEGIN 
  -- Base SQL statement 
  DECLARE base_sql STRING DEFAULT "WITH analytics AS (SELECT event_name, ... FROM `ga-xx.analytics_xx.events_intraday_*` WHERE _TABLE_SUFFIX BETWEEN '" || startDate || "' AND '" || endDate || "') SELECT * FROM analytics;"; 

 -- Pre-aggregated metadata tables (param_meta, user_prop_meta)
 DECLARE params_sql ARRAY<STRING>; 
 DECLARE user_props_sql ARRAY<STRING>; 

 -- Construct SQL fragments (optimized)
 FOR param IN (SELECT extract_query FROM `ga-xx.sandbox_xx.param_meta`) DO 
   SET params_sql = ARRAY_CONCAT(params_sql, [param.extract_query]); 
 END FOR; 

 -- ... (similarly for user_props_sql)

 -- Combine into final SQL (more efficient concatenation)
 SET sql = CONCAT(base_sql, STRING_AGG(params_sql, ', '), STRING_AGG(user_props_sql, ', ')); 
END; 

 

View solution in original post

ms4446

Hi @sysph ,

The error you're encountering occurs because BigQuery doesn't support directly inserting variables like full_table_name into the FROM clause of a standard SQL query. This is a security and optimization feature.

To work around this, you need to:

Construct your SQL query as a string. Include placeholders for the dynamic parts (like the table name).
Execute the constructed string using EXECUTE IMMEDIATE.

 

CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app ( 
    project_id STRING, 
    dataset_name STRING, 
    table_name STRING, 
    startDate STRING, 
    endDate STRING, 
    OUT sql STRING 
) 
BEGIN 
    -- Declare variables 
    DECLARE full_table_name STRING; 
    DECLARE query_statement STRING; 

    -- Construct the full table name 
    SET full_table_name = CONCAT('`', project_id, '.', dataset_name, '.', table_name, '`'); 

    -- Build the dynamic SQL statement
    SET query_statement = CONCAT(
        "WITH analytics AS (",
            "SELECT event_name, event_date, event_timestamp, ", 
            "       TIMESTAMP_MICROS(event_timestamp) as event_datetime, ",
            "       event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, ",
            "       event_server_timestamp_offset, user_id, user_pseudo_id ",
            "FROM ", full_table_name, " ",
            "WHERE _TABLE_SUFFIX BETWEEN '", startDate, "' AND '", endDate, "'" 
        ")" 
    ); 

    -- Add any further dynamic parts to the query as needed

    -- Finalize the SQL statement 
    SET query_statement = CONCAT(query_statement, " SELECT * FROM analytics;"); 

    -- Return the constructed SQL statement 
    SET sql = query_statement; 
END; 

How to Execute:

 

DECLARE sql STRING; 
CALL `ga-xx.sandbox_xx`.proc_app(
     'ga-xx', 
     'analytics_zz', 
     'events_intraday_*', 
     '20240319', 
     '20240319', 
     sql
); 

EXECUTE IMMEDIATE sql; 

 

View solution in original post

ms4446

Hi @sysph ,

You've encountered a common challenge: not all fields you might expect from the GA4 schema are present in every dataset. This discrepancy can occur due to several reasons:

Custom Events and Parameters: GA4 allows for custom events and parameters, which means not all datasets will have the same schema. Some fields might be present in one dataset but absent in another due to the custom nature of event tracking set up by different users.
Schema Evolution: The schema for GA4 data in BigQuery can evolve over time as new features are added to GA4 and as Google modifies the data export format. This can lead to situations where newer datasets have fields that older datasets do not.
Export Configurations: The configuration of the GA4 property and its data export to BigQuery might affect which fields are available. Some fields might be omitted based on user settings or the specific events being tracked.

To handle these discrepancies and efficiently retrieve available fields, consider the following strategies:

Dynamic Schema Discovery: Your current approach of using INFORMATION_SCHEMA.COLUMN_FIELD_PATHS for dynamic schema discovery is on the right track. This allows you to adapt to the available schema dynamically.
Graceful Handling of Missing Fields: When constructing your query, it might be beneficial to check if a field exists in the dataset before attempting to include it in your query. This can be challenging to do directly in SQL but consider preprocessing steps that might filter out non-existent fields based on your dynamic schema discovery.
Fallback Mechanisms: For critical fields that might be missing, consider implementing fallback mechanisms where possible. For example, if a specific user property is not available, you might be able to infer or approximate its value from other available data.
Schema Versioning: Keep track of schema changes over time. If you're working with historical data, understanding when certain fields were introduced or deprecated can help you adjust your queries accordingly.

View solution in original post

ms4446

To build a query string dynamically within a BigQuery stored procedure, especially when you need to handle different data types with conditional logic, you will have to piece together your SQL string by checking each condition and appending the appropriate part of the SQL based on the outcome. While you were thinking of using a CASE statement within a loop to handle these conditions, BigQuery's scripting doesn't support directly assigning a variable's value with a CASE statement outside a SELECT query. A more straightforward approach is to use IF statements or to integrate the CASE statement within the construction of your query string itself.

Here's a revised approach to build your flatten_statement string within the loop, using IF statements for clarity and simplicity:

CREATE OR REPLACE PROCEDURE `ga-xxx.sandbox_x`.proc_app() 
BEGIN 
  DECLARE value_type_statement STRING DEFAULT ''; 
  DECLARE flatten_statement STRING DEFAULT 'SELECT 
                                               event_name, 
                                               event_date, 
                                               event_timestamp, 
                                               TIMESTAMP_MICROS(event_timestamp) AS event_datetime, 
                                               event_previous_timestamp, 
                                               event_value_in_usd, 
                                               event_bundle_sequence_id, 
                                               event_server_timestamp_offset, 
                                               user_id, 
                                               user_pseudo_id'; 

  -- Assuming you want to append to the flatten_statement for each parameter 
  FOR iterator IN ( 
       WITH events AS ( 
            SELECT 
               event_name, 
               param.key AS name, 
               CASE 
                 WHEN param.value.string_value IS NOT NULL THEN 'string' 
                 WHEN param.value.int_value IS NOT NULL THEN 'int' 
                 WHEN param.value.double_value IS NOT NULL THEN 'double' 
                 WHEN param.value.float_value IS NOT NULL THEN 'float' 
                 END AS value 
            FROM `ga-xx.analytics_xxx.events_*`, 
                 UNNEST(event_params) AS param 
            WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315' 
            GROUP BY event_name, name, value 
       ) 
       SELECT event_name, dimension.name, dimension.value 
       FROM ( SELECT event_name, ARRAY_AGG(STRUCT(name, value) ORDER BY name) AS attribute 
              FROM events 
              GROUP BY event_name 
              ORDER BY event_name ), 
            UNNEST(attribute) AS dimension 
  ) DO 
    SET value_type_statement = CASE iterator.value 
                                 WHEN 'string' THEN 'STRING' 
                                 WHEN 'int' THEN 'INT64' 
                                 WHEN 'double' THEN 'FLOAT64' 
                                 WHEN 'float' THEN 'FLOAT64' 
                                 ELSE 'STRING' -- Default case if needed 
                               END; 

    -- Append to the flatten_statement for each parameter 
    SET flatten_statement = CONCAT(flatten_statement, 
                                   ', (SELECT value.', value_type_statement, '_value 
                                       FROM UNNEST(event_params) 
                                       WHERE key = ''', iterator.name, ''') AS ', iterator.name); 
  END FOR; 

  -- Debug: Output the constructed query (remove in production) 
  SELECT flatten_statement; 
END; 

 

sysph

Hi @ms4446 Thank you so much for replying. I got it fixed before i saw your message. I'll definitely test your code snippet. However, might not the quickest query. Anyway i could optimise it? I will modify to give default value in declaring variable. thanks again. have a nice weekend.

create or replace procedure `ga-xx.sandbox_xx`.proc_app (startDate STRING, endDate STRING, OUT sql STRING)
begin
declare value_type_statement string;
declare user_prop_statement string;
declare flatten_statement string; 


set flatten_statement = "WITH analytics AS ("
    || "SELECT "
    || "event_name"
    || ", event_date"
    || ", event_timestamp"
    || ", TIMESTAMP_MICROS(event_timestamp) as event_datetime"
    || ", event_previous_timestamp"
    || ", event_value_in_usd"
    || ", event_bundle_sequence_id"
    || ", event_server_timestamp_offset"
    || ", user_id"
    || ", user_pseudo_id" ;

for iterator in (with events as (
                  SELECT
                    DISTINCT param.key as name,
                    CASE
                      WHEN param.value.string_value is not null THEN ', (SELECT value.string_value ' --"string"
                      WHEN param.value.int_value is not null THEN ', (SELECT value.int_value ' --"int"
                      WHEN param.value.double_value is not null THEN ', (SELECT value.double_value ' --"double"
                      WHEN param.value.float_value is not null THEN  ', (SELECT value.float_value ' --"float"
                    END as value
                  FROM `ga-xx.analytics_xx.events_intraday_*`,
                  UNNEST(event_params) as param
                  WHERE _TABLE_SUFFIX between startDate and endDate
                  --DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
                  GROUP BY name,value
                )
                select dimension.name, dimension.value from (
                SELECT
                  ARRAY_AGG(struct(name, value) order by name) as attribute
                FROM events
                GROUP BY name
                ), unnest(attribute) as dimension
                LIMIT 5)
do

                -- IF (iterator.value = "string") THEN set value_type_statement = ', (SELECT value.string_value '; end if;
                -- IF (iterator.value = "int") THEN set value_type_statement = ', (SELECT value.int_value '; end if; 
                -- IF (iterator.value = "double") THEN set value_type_statement = ', (SELECT value.double_value '; end if;
                -- IF (iterator.value = "float") THEN set value_type_statement = ', (SELECT value.float_value '; end if;

                set flatten_statement =  flatten_statement
                --value_type_statement 
                  || iterator.value
                  || " FROM UNNEST(event_params) WHERE key = '" || iterator.name || "') AS " || iterator.name;
                 -- , (SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'KGVideoElapsedSeconds') AS KGVideoElapsedSeconds

end for;

for iterator_userprop in (with user_props as (
                    select
                      DISTINCT user_prop.key as user_prop_key,
                      case when user_prop.value.string_value is not null then ', (SELECT value.string_value '
                          when user_prop.value.int_value is not null then ', (SELECT value.int_value ' --"int"
                          when user_prop.value.double_value is not null then ', (SELECT value.double_value ' --"double"
                          when user_prop.value.float_value is not null then ', (SELECT value.float_value ' --"float"
                          else null end as user_prop_value
                    from
                        `ga-xx.analytics_xx.events_intraday_*`,
                        unnest(user_properties) as user_prop
                    group by
                        user_prop_key,
                        user_prop_value
                    order by
                        user_prop_key
                    )
                    select dimension.user_prop_key, dimension.user_prop_value from (
                    SELECT
                      ARRAY_AGG(struct(user_prop_key, user_prop_value) order by user_prop_key) as up_attribute
                    FROM user_props
                    GROUP BY user_prop_key
                    ORDER BY user_prop_key
                    ), unnest(up_attribute) as dimension)  
do
  set flatten_statement = flatten_statement 
    || iterator_userprop.user_prop_value
    || " FROM UNNEST(user_properties) WHERE key = '" || iterator_userprop.user_prop_key || "') AS " || iterator_userprop.user_prop_key;
end for;  

set flatten_statement = flatten_statement
||    " , privacy_info.analytics_storage AS analytics_storage"
||    " , privacy_info.ads_storage AS ads_storage"
||    " , privacy_info.uses_transient_token AS uses_transient_token"
||    " , user_first_touch_timestamp "
||    " , user_ltv.revenue AS user_ltv_revenue"
||    " , user_ltv.currency AS user_ltv_currency"
||    " FROM `ga-xx.analytics_xx.events_intraday_*` "
||    " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' "
||    " ) "
||    "  SELECT * FROM analytics;";

--EXECUTE IMMEDIATE (flatten_statement);
    set sql = flatten_statement;
end;


--DECLARE sql STRING;
--CALL `ga-xx.sandbox_xx.proc_xx_app`('20240315','20240315',sql);
--SELECT sql;

ms4446

Here are several strategies to enhance performance and maintainability:

1. Minimize Dynamic SQL Construction Complexity

Dynamic SQL creation can be computationally expensive. Simplify the logic to reduce the complexity and number of iterations within your procedure.

2. Use Pre-aggregated Metadata

Create a table storing metadata about event parameters and user properties (data types, extraction logic).
This pre-computed data makes dynamic SQL generation simpler and faster.

3. Optimize Iteration Logic

Group similar operations to minimize iterations.
Handle cases with shared SQL generation logic together.

4. Leverage BigQuery Functions

BigQuery's built-in functions often simplify complex logic.
Consider using functions flexible with multiple data types rather than dynamic type-based cases.

5. Streamline SQL String Concatenation

Use efficient string operations or minimize the need for concatenation entirely.

Without altering the core logic of your procedure significantly, here are a few adjustments for potential optimizations:

 

-- Example of a simplified approach (conceptual, adjust as needed): 
CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app(startDate STRING, endDate STRING, OUT sql STRING) 
BEGIN 
  -- Base SQL statement 
  DECLARE base_sql STRING DEFAULT "WITH analytics AS (SELECT event_name, ... FROM `ga-xx.analytics_xx.events_intraday_*` WHERE _TABLE_SUFFIX BETWEEN '" || startDate || "' AND '" || endDate || "') SELECT * FROM analytics;"; 

 -- Pre-aggregated metadata tables (param_meta, user_prop_meta)
 DECLARE params_sql ARRAY<STRING>; 
 DECLARE user_props_sql ARRAY<STRING>; 

 -- Construct SQL fragments (optimized)
 FOR param IN (SELECT extract_query FROM `ga-xx.sandbox_xx.param_meta`) DO 
   SET params_sql = ARRAY_CONCAT(params_sql, [param.extract_query]); 
 END FOR; 

 -- ... (similarly for user_props_sql)

 -- Combine into final SQL (more efficient concatenation)
 SET sql = CONCAT(base_sql, STRING_AGG(params_sql, ', '), STRING_AGG(user_props_sql, ', ')); 
END; 

 

sysph

Thanks @ms4446 it's so helpful!

I have a tiny question here. I pass in the project id, dataset name, table name, and date range to the stored procedure. There's a error :invalid value: Table "full_table_name" must be qualified with a dataset (e.g. dataset.table) something wrong with the full_table_name variable which i couldn't make it work.

create or replace procedure `ga-xx.sandbox_xx`.proc_xapp (project_id STRING, dataset_name STRING, table_name STRING, startDate STRING, endDate STRING, OUT sql STRING)
begin
-- declare row_count INT64;
declare full_table_name string;
declare value_type_statement string;
declare user_prop_statement string;
declare flatten_statement string DEFAULT "WITH analytics AS ("
                                            || "SELECT "
                                            || "event_name  \n"
                                            || ", event_date \n"
                                            || ", event_timestamp \n"
                                            || ", TIMESTAMP_MICROS(event_timestamp) as event_datetime \n"
                                            || ", event_previous_timestamp \n"
                                            || ", event_value_in_usd \n"
                                            || ", event_bundle_sequence_id \n"
                                            || ", event_server_timestamp_offset \n"
                                            || ", user_id \n"
                                            || ", user_pseudo_id \n";

set full_table_name = '`'||project_id||'.'||dataset_name||'.'||table_name||'`';
for iterator in (with events as (
                  SELECT
                    DISTINCT param.key as name,
                    CASE
                      WHEN param.value.string_value is not null THEN ', (SELECT value.string_value ' --"string"
                      WHEN param.value.int_value is not null THEN ', (SELECT value.int_value ' --"int"
                      WHEN param.value.double_value is not null THEN ', (SELECT value.double_value ' --"double"
                      WHEN param.value.float_value is not null THEN  ', (SELECT value.float_value ' --"float"
                    END as value
                  x
                  FROM full_table_name
                  WHERE _TABLE_SUFFIX between startDate and endDate
                  GROUP BY name,value
                )
                select dimension.name, dimension.value from (
                SELECT
                  ARRAY_AGG(struct(name, value) order by name) as attribute
                FROM events
                GROUP BY name
                ), unnest(attribute) as dimension
                LIMIT 5) 
do

                -- IF (iterator.value = "string") THEN set value_type_statement = ', (SELECT value.string_value '; end if;
                -- IF (iterator.value = "int") THEN set value_type_statement = ', (SELECT value.int_value '; end if; 
                -- IF (iterator.value = "double") THEN set value_type_statement = ', (SELECT value.double_value '; end if;
                -- IF (iterator.value = "float") THEN set value_type_statement = ', (SELECT value.float_value '; end if;

                set flatten_statement = flatten_statement
                  || iterator.value
                  || " FROM UNNEST(event_params) WHERE key = '" || iterator.name || "') AS " || iterator.name||" \n";

end for;
for iterator_userprop in (with user_props as (
                    select
                      DISTINCT user_prop.key as user_prop_key,
                      case when user_prop.value.string_value is not null then ', (SELECT value.string_value ' --"string"
                          when user_prop.value.int_value is not null then ', (SELECT value.int_value ' --"int"
                          when user_prop.value.double_value is not null then ', (SELECT value.double_value ' --"double"
                          when user_prop.value.float_value is not null then ', (SELECT value.float_value ' --"float"
                          else null 
                      end as user_prop_value
                    from
                        `ga-xx.analytics_xx.events_intraday_*`,unnest(user_properties) as user_prop
                    WHERE _TABLE_SUFFIX between startDate and endDate
                    group by
                        user_prop_key,
                        user_prop_value
                    order by
                        user_prop_key
                    )
                    select dimension.user_prop_key, dimension.user_prop_value from (
                    SELECT
                      ARRAY_AGG(struct(user_prop_key, user_prop_value) order by user_prop_key) as up_attribute
                    FROM user_props
                    GROUP BY user_prop_key
                    ORDER BY user_prop_key
                    ), unnest(up_attribute) as dimension)  
do
              set flatten_statement = flatten_statement 
                || iterator_userprop.user_prop_value
                || " FROM UNNEST(user_properties) WHERE key = '" || iterator_userprop.user_prop_key || "') AS " || iterator_userprop.user_prop_key||"\n";
end for;  


 set flatten_statement = flatten_statement

||    " FROM `ga-xx.analytics_xx.events_intraday_*` "
||    " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' "
||    " ) "
||    "  SELECT * FROM analytics;";

--EXECUTE IMMEDIATE (flatten_statement);
    set sql = flatten_statement;
end;

DECLARE sql STRING;
CALL `ga-xx.sandbox_yy.proc_app`('ga-xx','analytics_zz','events_intraday_*','20240319','20240319',sql);
SELECT sql;   
EXECUTE IMMEDIATE sql;

Once it's pass the test, i'll start optimise the code. Thanks for helping

ms4446

Hi @sysph ,

The error you're encountering occurs because BigQuery doesn't support directly inserting variables like full_table_name into the FROM clause of a standard SQL query. This is a security and optimization feature.

To work around this, you need to:

Construct your SQL query as a string. Include placeholders for the dynamic parts (like the table name).
Execute the constructed string using EXECUTE IMMEDIATE.

 

CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app ( 
    project_id STRING, 
    dataset_name STRING, 
    table_name STRING, 
    startDate STRING, 
    endDate STRING, 
    OUT sql STRING 
) 
BEGIN 
    -- Declare variables 
    DECLARE full_table_name STRING; 
    DECLARE query_statement STRING; 

    -- Construct the full table name 
    SET full_table_name = CONCAT('`', project_id, '.', dataset_name, '.', table_name, '`'); 

    -- Build the dynamic SQL statement
    SET query_statement = CONCAT(
        "WITH analytics AS (",
            "SELECT event_name, event_date, event_timestamp, ", 
            "       TIMESTAMP_MICROS(event_timestamp) as event_datetime, ",
            "       event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, ",
            "       event_server_timestamp_offset, user_id, user_pseudo_id ",
            "FROM ", full_table_name, " ",
            "WHERE _TABLE_SUFFIX BETWEEN '", startDate, "' AND '", endDate, "'" 
        ")" 
    ); 

    -- Add any further dynamic parts to the query as needed

    -- Finalize the SQL statement 
    SET query_statement = CONCAT(query_statement, " SELECT * FROM analytics;"); 

    -- Return the constructed SQL statement 
    SET sql = query_statement; 
END; 

How to Execute:

 

DECLARE sql STRING; 
CALL `ga-xx.sandbox_xx`.proc_app(
     'ga-xx', 
     'analytics_zz', 
     'events_intraday_*', 
     '20240319', 
     '20240319', 
     sql
); 

EXECUTE IMMEDIATE sql; 

 

sysph

Apologise here. The code snippet contains repeating query.

I got another question. I use the code snippet below to fetch the sub-column (the other nested and un-nested columns). (Some of the fields name is not existing/matching in the GA4 table (for app). Why is it? What would be the best way to retrieve those fields? Thanks

FOR iterator_nonnested in (WITH table_columns AS (
                      SELECT
                          DISTINCT column_name, field_path, data_type 
                        FROM
                          `ga-xx`.analytics_yy.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
                        -- WHERE
                        --   table_name = 'events_'||startDate
                      )
                      SELECT *
                      FROM table_columns 
                      WHERE column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','user_info','event_params','user_properties','items','audiences','predictions','occurrence_date','last_updated_date')
                      AND field_path NOT IN ('device.unified_screen_name','user_ltv.revenue_in_usd','user_ltv.sessions','user_ltv.engagement_time_millis','user_ltv.purchases','user_ltv.engaged_sessions','user_ltv.session_duration_micros','privacy_info.is_limited_ad_tracking','privacy_info.is_ads_personalization_allowed'))
DO
 -- if column_name != field_path then directly add to flatten_satement 
              IF (iterator_nonnested.column_name != iterator_nonnested.field_path) THEN 
                set flatten_statement = flatten_statement 
                  || ", " || iterator_nonnested.field_path || " AS " || REPLACE( iterator_nonnested.field_path, '.', '_') || " \n"; 
              END IF;

              -- if column_name = field_path and count =1 then add it to flatten_statement eg. is_active_user only 1
              IF (iterator_nonnested.column_name = iterator_nonnested.field_path) THEN 
                  for iterator_column_count in (WITH table_column_count AS (
                      SELECT
                          DISTINCT column_name, field_path, data_type 
                        FROM
                          `ga-xx`.analytics_yy.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
                        -- WHERE
                        --   table_name = 'events_'||startDate
                      )
                      SELECT count(*) AS column_name_count,column_name
                      FROM table_column_count 
                      WHERE column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','user_info','event_params','user_properties','items','audiences','predictions','occurrence_date','last_updated_date')
                      AND field_path NOT IN ('device.unified_screen_name','user_ltv.revenue_in_usd','user_ltv.sessions','user_ltv.engagement_time_millis','user_ltv.purchases','user_ltv.engaged_sessions','user_ltv.session_duration_micros','privacy_info.is_limited_ad_tracking','privacy_info.is_ads_personalization_allowed')
                      group by column_name
                      having column_name = iterator_nonnested.column_name)
                  do
                      IF (iterator_column_count.column_name_count= 1) THEN 
                        set flatten_statement = flatten_statement || ", " || iterator_nonnested.field_path || " \n";
                      END IF;    
                  end for;
              END IF;
END FOR;

ms4446

Hi @sysph ,

You've encountered a common challenge: not all fields you might expect from the GA4 schema are present in every dataset. This discrepancy can occur due to several reasons:

Custom Events and Parameters: GA4 allows for custom events and parameters, which means not all datasets will have the same schema. Some fields might be present in one dataset but absent in another due to the custom nature of event tracking set up by different users.
Schema Evolution: The schema for GA4 data in BigQuery can evolve over time as new features are added to GA4 and as Google modifies the data export format. This can lead to situations where newer datasets have fields that older datasets do not.
Export Configurations: The configuration of the GA4 property and its data export to BigQuery might affect which fields are available. Some fields might be omitted based on user settings or the specific events being tracked.

To handle these discrepancies and efficiently retrieve available fields, consider the following strategies:

Dynamic Schema Discovery: Your current approach of using INFORMATION_SCHEMA.COLUMN_FIELD_PATHS for dynamic schema discovery is on the right track. This allows you to adapt to the available schema dynamically.
Graceful Handling of Missing Fields: When constructing your query, it might be beneficial to check if a field exists in the dataset before attempting to include it in your query. This can be challenging to do directly in SQL but consider preprocessing steps that might filter out non-existent fields based on your dynamic schema discovery.
Fallback Mechanisms: For critical fields that might be missing, consider implementing fallback mechanisms where possible. For example, if a specific user property is not available, you might be able to infer or approximate its value from other available data.
Schema Versioning: Keep track of schema changes over time. If you're working with historical data, understanding when certain fields were introduced or deprecated can help you adjust your queries accordingly.

sysph

Here's my final stored procedure to generate flatten query and it's faster than the previous one that uses lots for... loops. Please let me know if any better way to improve this. Thanks in advance

CREATE OR REPLACE PROCEDURE `ga-xx.dataset`.proc_app_optimised (project_id STRING, dataset_name STRING, table_name STRING, startDate STRING, endDate STRING, OUT sql STRING) 
BEGIN
    DECLARE table_name_argument STRING;
    DECLARE full_table_name string;
    -- Base SQL statement
    DECLARE flatten_statement string DEFAULT "WITH analytics AS ("
                                                || "SELECT "
                                                || "event_name  \n"
                                                || ", event_date \n"
                                                || ", event_timestamp \n"
                                                || ", TIMESTAMP_MICROS(event_timestamp) as event_datetime \n"
                                                || ", event_previous_timestamp \n"
                                                || ", event_value_in_usd \n"
                                                || ", event_bundle_sequence_id \n"
                                                || ", event_server_timestamp_offset \n"
                                                || ", user_id \n"
                                                || ", user_pseudo_id \n";


-- Pre-aggregated metadata tables (param_meta, user_prop_meta)
    DECLARE event_params_sql string;
    DECLARE user_props_sql string;
    DECLARE other_unnested_columns_sql string;

    --DECLARE params_sql ARRAY<STRING>;
    -- DECLARE user_props_sql ARRAY<STRING>;
    
-- Construct the full table name 
    SET full_table_name = CONCAT(project_id, '.', dataset_name, '.', table_name); 

-- Construct all event paramters 
    EXECUTE IMMEDIATE format(r"""
                        with events as (
                            SELECT
                            DISTINCT param.key as name,
                            CASE
                                WHEN param.value.string_value is not null THEN ' (SELECT value.string_value '
                                WHEN param.value.int_value is not null THEN ' (SELECT value.int_value '
                                WHEN param.value.double_value is not null THEN ' (SELECT value.double_value '
                                WHEN param.value.float_value is not null THEN  ' (SELECT value.float_value '
                            END as value
                            FROM `%s`,UNNEST(event_params) as param
                            WHERE _TABLE_SUFFIX between '%s' and '%s'
                            GROUP BY name,value)

                                 SELECT STRING_AGG(value|| "FROM UNNEST(event_params) WHERE key = '" || name || "') AS " || name || '\n' ORDER BY name) 
                                FROM events
                                """
                        ,full_table_name,startDate, endDate) INTO event_params_sql;


                       set flatten_statement = flatten_statement || ',' || event_params_sql;

--Construct all user properties
    EXECUTE IMMEDIATE format(r"""
                            with user_props as (
                                select
                                DISTINCT user_prop.key as user_prop_key,
                                case when user_prop.value.string_value is not null then ' (SELECT value.string_value ' 
                                    when user_prop.value.int_value is not null then ' (SELECT value.int_value '
                                    when user_prop.value.double_value is not null then ' (SELECT value.double_value ' 
                                    when user_prop.value.float_value is not null then ' (SELECT value.float_value '
                                    else null 
                                end as user_prop_value
                                from `%s`,unnest(user_properties) as user_prop
                                WHERE _TABLE_SUFFIX between '%s' and '%s'
                                group by
                                    user_prop_key,
                                    user_prop_value
                                order by
                                    user_prop_key
                                )

                                SELECT STRING_AGG(user_prop_value|| "FROM UNNEST(user_properties) WHERE key = '" || user_prop_key || "') AS " || user_prop_key || '\n' ORDER BY user_prop_key) 
                                FROM user_props
                               
                            """
                            ,full_table_name,startDate, endDate) INTO user_props_sql;

   set flatten_statement = flatten_statement || ',' || user_props_sql;

IF (table_name = 'events_intraday_*') THEN
    set table_name_argument = 'events_intraday';
ELSE 
    set table_name_argument = 'events';
END IF;

--Construct all other column
 EXECUTE IMMEDIATE format(r"""
                            WITH table_columns AS (
                                SELECT DISTINCT column_name, field_path, data_type 
                                FROM `%s.%s.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
                                WHERE table_catalog = '%s'
                                AND table_schema = '%s'
                                AND table_name BETWEEN '%s_%s' and '%s_%s'
                                AND data_type IN ('STRING','INT64','FLOAT64','NUMERIC','TIMESTAMP','DATE','TIME','DATETIME','BOOL','BIGNUMERIC','ARRAY')
                                AND column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','event_params','user_properties','items'))
                            SELECT STRING_AGG(field_path || " AS " || REPLACE(field_path, '.', '_') || ' \n' ORDER BY field_path) FROM table_columns
                          """
                          ,project_id,dataset_name,project_id,dataset_name,table_name_argument,startDate,table_name_argument,endDate) INTO other_unnested_columns_sql;
    set flatten_statement = flatten_statement || ',' || other_unnested_columns_sql;


 set flatten_statement = flatten_statement 
||    " FROM `"||full_table_name||"` \n"
||    " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' \n"
||    " ) \n"
||    "  SELECT * FROM analytics;";

set sql = flatten_statement;
    
end;


-- DECLARE sql STRING;
-- CALL `ga-xx.sandbox_xx.proc_app_optimise`('ga-xx','analytics_xx','events_*','20240327','20240328',sql);
-- SELECT sql;    
-- EXECUTE IMMEDIATE (sql);

ms4446

Great job on updating your stored procedure. The improvements you've made, particularly moving away from a loop-heavy approach, should enhance both performance and maintainability.

Some Observations

Your proficient use of EXECUTE IMMEDIATE in conjunction with FORMAT() effectively streamlines the construction of subqueries. This strategy adeptly handles variations in event parameters, user properties, and other columns, playing a pivotal role in the observed performance enhancements.
Your strategic limitation of included columns based on their data types showcases a deep understanding of performance optimization. By customizing your query's output, you minimize processing overhead and facilitate downstream data handling.

Here are some suggestions for enhancement:

Parameter Validation: Strengthening the stored procedure's reliability through input validation for parameters such as project_id, dataset_name, table_name, startDate, and endDate is essential. Implementing checks for non-null values, verifying date formats, and ensuring logical date sequencing can mitigate common issues.
Error Handling: Adopting BigQuery's BEGIN...EXCEPTION...END structure allows for more sophisticated error management within your dynamic SQL constructs. This addition is vital for easing debugging and maintenance, particularly in production environments.

Exploring ARRAY_AGG: Although STRING_AGG currently fulfills your requirements, ARRAY_AGG might provide benefits by retaining data type information within your query's output. This could be particularly advantageous for analyses dependent on preserving original data types.
Strategic Use of Temporary Tables: If your final query often references the same intermediate results, employing temporary tables to materialize these results could enhance performance. Benchmarking with realistic datasets will inform this decision.

Enhancing ARRAY_AGG Usage

The example you provided for utilizing ARRAY_AGG to preserve type information within the flattened output is insightful. It underscores how maintaining data types can enrich subsequent analyses.

Use case...when...then...end statement in stored procedure in BigQuery