Solved: Use case...when...then...end statement in stored p... - Page 2

sysph · 03-15-2024 07:35 AM

Hi there,

hope you can help me with this. I'm trying to use stored procedure to create flatten query and data.

I have this query to fetch all events and event parameters

with events as (
  SELECT
    event_name,
    param.key as name,
    CASE
      WHEN param.value.string_value is not null THEN "string"
      WHEN param.value.int_value is not null THEN "int"
      WHEN param.value.double_value is not null THEN "double"
      WHEN param.value.float_value is not null THEN "float"
    END as value
   FROM `xx.analytics_xxxx.events_*`,
  UNNEST(event_params) as param
  WHERE _TABLE_SUFFIX between '20240315' and '20240315'
  --DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
  GROUP BY event_name, name, value
)
 
 
select event_name, dimension.name, dimension.value from (
SELECT
  event_name,
  ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension

And i also have a flatten query

WITH analytics AS (
SELECT
    event_name
    , event_date
    , event_timestamp
    , TIMESTAMP_MICROS(event_timestamp) as event_datetime
    , event_previous_timestamp
    , event_value_in_usd
    , event_bundle_sequence_id
    , event_server_timestamp_offset
    , user_id
    , user_pseudo_id
    ------------------------------- Event Parameters ------------------------------
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Appearance') AS Appearance
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'CollectionSet') AS CollectionSet
    , (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Connectivity') AS Connectivity
...
FROM `ga-xxx.analytics_xxx.events_*`
    --, UNNEST(items) AS items
    WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315' 

)
, analytics_updated AS (
    SELECT
    *
    --CONCAT(ga_session_id,".",user_pseudo_id) AS unique_session_id
    --, *
    FROM analytics
)
SELECT * FROM analytics_updated

I'm trying to combine these 2 together. I use for...end for; to loop through the parameters in order to auto generate the select statement in the query. Here's what i have the moment, however, i'm stucked at the case ...when...then...end statement. How do i join the value into the query string (set value_type_statement = ...) into flatten_statement var? Thanks in advance.

create or replace procedure `ga-xx.sandbox_xx`.proc_app ()
begin
declare value_type_statement string;
declare parameters_statement string;
declare flatten_statement string; 


set flatten_statement = "select "
    || "event_name"
    || ", event_date"
    || ", event_timestamp"
    || ", TIMESTAMP_MICROS(event_timestamp) as event_datetime"
    || ", event_previous_timestamp"
    || ", event_value_in_usd"
    || ", event_bundle_sequence_id"
    || ", event_server_timestamp_offset"
    || ", user_id"
    || ", user_pseudo_id" ;

for iterator in (with events as (
  SELECT
    event_name,
    param.key as name,
    CASE
      WHEN param.value.string_value is not null THEN "string"
      WHEN param.value.int_value is not null THEN "int"
      WHEN param.value.double_value is not null THEN "double"
      WHEN param.value.float_value is not null THEN "float"
    END as value
   FROM `ga-xx.analytics_xxx.events_*`,
  UNNEST(event_params) as param
  WHERE _TABLE_SUFFIX between '20240315' and '20240315'
  --DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
  GROUP BY event_name, name, value
)
select event_name, dimension.name, dimension.value from (
SELECT
  event_name,
  ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension)
do

  -- IF( iterator.value== "string" , 'true', 'false') AS result
    -- CASE
    --   WHEN "string" THEN "(SELECT value.string_value "
    --   WHEN "int" THEN "(SELECT value.int_value "
    --   WHEN "double" THEN "(SELECT value.double_value "
    --   WHEN "float" THEN "(SELECT value.float_value "
    -- END as value
  IF (iterator.value = "string") THEN set value_type_statement = ' (SELECT value.string_value '; end if;
  IF (iterator.value = "int") THEN set value_type_statement = ' (SELECT value.int_value '; end if; 
  IF (iterator.value = "double") THEN set value_type_statement = ' (SELECT value.double_value '; end if;
  IF (iterator.value = "float") THEN set value_type_statement = ' (SELECT value.float_value '; end if;

  set flatten_statement = flatten_statement 
    || value_type_statement 
    || " FROM UNNEST(event_params) WHERE key = " || iterator.name || "AS" || iterator.name;

end for;

set flatten_statement = flatten_statement;
end;

--call `ga-xx.sandbox_xx`.proc_app ();

ms4446

To build a query string dynamically within a BigQuery stored procedure, especially when you need to handle different data types with conditional logic, you will have to piece together your SQL string by checking each condition and appending the appropriate part of the SQL based on the outcome. While you were thinking of using a CASE statement within a loop to handle these conditions, BigQuery's scripting doesn't support directly assigning a variable's value with a CASE statement outside a SELECT query. A more straightforward approach is to use IF statements or to integrate the CASE statement within the construction of your query string itself.

Here's a revised approach to build your flatten_statement string within the loop, using IF statements for clarity and simplicity:

CREATE OR REPLACE PROCEDURE `ga-xxx.sandbox_x`.proc_app() 
BEGIN 
  DECLARE value_type_statement STRING DEFAULT ''; 
  DECLARE flatten_statement STRING DEFAULT 'SELECT 
                                               event_name, 
                                               event_date, 
                                               event_timestamp, 
                                               TIMESTAMP_MICROS(event_timestamp) AS event_datetime, 
                                               event_previous_timestamp, 
                                               event_value_in_usd, 
                                               event_bundle_sequence_id, 
                                               event_server_timestamp_offset, 
                                               user_id, 
                                               user_pseudo_id'; 

  -- Assuming you want to append to the flatten_statement for each parameter 
  FOR iterator IN ( 
       WITH events AS ( 
            SELECT 
               event_name, 
               param.key AS name, 
               CASE 
                 WHEN param.value.string_value IS NOT NULL THEN 'string' 
                 WHEN param.value.int_value IS NOT NULL THEN 'int' 
                 WHEN param.value.double_value IS NOT NULL THEN 'double' 
                 WHEN param.value.float_value IS NOT NULL THEN 'float' 
                 END AS value 
            FROM `ga-xx.analytics_xxx.events_*`, 
                 UNNEST(event_params) AS param 
            WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315' 
            GROUP BY event_name, name, value 
       ) 
       SELECT event_name, dimension.name, dimension.value 
       FROM ( SELECT event_name, ARRAY_AGG(STRUCT(name, value) ORDER BY name) AS attribute 
              FROM events 
              GROUP BY event_name 
              ORDER BY event_name ), 
            UNNEST(attribute) AS dimension 
  ) DO 
    SET value_type_statement = CASE iterator.value 
                                 WHEN 'string' THEN 'STRING' 
                                 WHEN 'int' THEN 'INT64' 
                                 WHEN 'double' THEN 'FLOAT64' 
                                 WHEN 'float' THEN 'FLOAT64' 
                                 ELSE 'STRING' -- Default case if needed 
                               END; 

    -- Append to the flatten_statement for each parameter 
    SET flatten_statement = CONCAT(flatten_statement, 
                                   ', (SELECT value.', value_type_statement, '_value 
                                       FROM UNNEST(event_params) 
                                       WHERE key = ''', iterator.name, ''') AS ', iterator.name); 
  END FOR; 

  -- Debug: Output the constructed query (remove in production) 
  SELECT flatten_statement; 
END; 

 

View solution in original post

ms4446

Here are several strategies to enhance performance and maintainability:

1. Minimize Dynamic SQL Construction Complexity

Dynamic SQL creation can be computationally expensive. Simplify the logic to reduce the complexity and number of iterations within your procedure.

2. Use Pre-aggregated Metadata

Create a table storing metadata about event parameters and user properties (data types, extraction logic).
This pre-computed data makes dynamic SQL generation simpler and faster.

3. Optimize Iteration Logic

Group similar operations to minimize iterations.
Handle cases with shared SQL generation logic together.

4. Leverage BigQuery Functions

BigQuery's built-in functions often simplify complex logic.
Consider using functions flexible with multiple data types rather than dynamic type-based cases.

5. Streamline SQL String Concatenation

Use efficient string operations or minimize the need for concatenation entirely.

Without altering the core logic of your procedure significantly, here are a few adjustments for potential optimizations:

 

-- Example of a simplified approach (conceptual, adjust as needed): 
CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app(startDate STRING, endDate STRING, OUT sql STRING) 
BEGIN 
  -- Base SQL statement 
  DECLARE base_sql STRING DEFAULT "WITH analytics AS (SELECT event_name, ... FROM `ga-xx.analytics_xx.events_intraday_*` WHERE _TABLE_SUFFIX BETWEEN '" || startDate || "' AND '" || endDate || "') SELECT * FROM analytics;"; 

 -- Pre-aggregated metadata tables (param_meta, user_prop_meta)
 DECLARE params_sql ARRAY<STRING>; 
 DECLARE user_props_sql ARRAY<STRING>; 

 -- Construct SQL fragments (optimized)
 FOR param IN (SELECT extract_query FROM `ga-xx.sandbox_xx.param_meta`) DO 
   SET params_sql = ARRAY_CONCAT(params_sql, [param.extract_query]); 
 END FOR; 

 -- ... (similarly for user_props_sql)

 -- Combine into final SQL (more efficient concatenation)
 SET sql = CONCAT(base_sql, STRING_AGG(params_sql, ', '), STRING_AGG(user_props_sql, ', ')); 
END; 

 

View solution in original post

ms4446

Hi @sysph ,

The error you're encountering occurs because BigQuery doesn't support directly inserting variables like full_table_name into the FROM clause of a standard SQL query. This is a security and optimization feature.

To work around this, you need to:

Construct your SQL query as a string. Include placeholders for the dynamic parts (like the table name).
Execute the constructed string using EXECUTE IMMEDIATE.

 

CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app ( 
    project_id STRING, 
    dataset_name STRING, 
    table_name STRING, 
    startDate STRING, 
    endDate STRING, 
    OUT sql STRING 
) 
BEGIN 
    -- Declare variables 
    DECLARE full_table_name STRING; 
    DECLARE query_statement STRING; 

    -- Construct the full table name 
    SET full_table_name = CONCAT('`', project_id, '.', dataset_name, '.', table_name, '`'); 

    -- Build the dynamic SQL statement
    SET query_statement = CONCAT(
        "WITH analytics AS (",
            "SELECT event_name, event_date, event_timestamp, ", 
            "       TIMESTAMP_MICROS(event_timestamp) as event_datetime, ",
            "       event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, ",
            "       event_server_timestamp_offset, user_id, user_pseudo_id ",
            "FROM ", full_table_name, " ",
            "WHERE _TABLE_SUFFIX BETWEEN '", startDate, "' AND '", endDate, "'" 
        ")" 
    ); 

    -- Add any further dynamic parts to the query as needed

    -- Finalize the SQL statement 
    SET query_statement = CONCAT(query_statement, " SELECT * FROM analytics;"); 

    -- Return the constructed SQL statement 
    SET sql = query_statement; 
END; 

How to Execute:

 

DECLARE sql STRING; 
CALL `ga-xx.sandbox_xx`.proc_app(
     'ga-xx', 
     'analytics_zz', 
     'events_intraday_*', 
     '20240319', 
     '20240319', 
     sql
); 

EXECUTE IMMEDIATE sql; 

 

View solution in original post

ms4446

Hi @sysph ,

You've encountered a common challenge: not all fields you might expect from the GA4 schema are present in every dataset. This discrepancy can occur due to several reasons:

Custom Events and Parameters: GA4 allows for custom events and parameters, which means not all datasets will have the same schema. Some fields might be present in one dataset but absent in another due to the custom nature of event tracking set up by different users.
Schema Evolution: The schema for GA4 data in BigQuery can evolve over time as new features are added to GA4 and as Google modifies the data export format. This can lead to situations where newer datasets have fields that older datasets do not.
Export Configurations: The configuration of the GA4 property and its data export to BigQuery might affect which fields are available. Some fields might be omitted based on user settings or the specific events being tracked.

To handle these discrepancies and efficiently retrieve available fields, consider the following strategies:

Dynamic Schema Discovery: Your current approach of using INFORMATION_SCHEMA.COLUMN_FIELD_PATHS for dynamic schema discovery is on the right track. This allows you to adapt to the available schema dynamically.
Graceful Handling of Missing Fields: When constructing your query, it might be beneficial to check if a field exists in the dataset before attempting to include it in your query. This can be challenging to do directly in SQL but consider preprocessing steps that might filter out non-existent fields based on your dynamic schema discovery.
Fallback Mechanisms: For critical fields that might be missing, consider implementing fallback mechanisms where possible. For example, if a specific user property is not available, you might be able to infer or approximate its value from other available data.
Schema Versioning: Keep track of schema changes over time. If you're working with historical data, understanding when certain fields were introduced or deprecated can help you adjust your queries accordingly.

View solution in original post

Use case...when...then...end statement in stored procedure in BigQuery