Hi there,
hope you can help me with this. I'm trying to use stored procedure to create flatten query and data.
I have this query to fetch all events and event parameters
with events as (
SELECT
event_name,
param.key as name,
CASE
WHEN param.value.string_value is not null THEN "string"
WHEN param.value.int_value is not null THEN "int"
WHEN param.value.double_value is not null THEN "double"
WHEN param.value.float_value is not null THEN "float"
END as value
FROM `xx.analytics_xxxx.events_*`,
UNNEST(event_params) as param
WHERE _TABLE_SUFFIX between '20240315' and '20240315'
--DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
GROUP BY event_name, name, value
)
select event_name, dimension.name, dimension.value from (
SELECT
event_name,
ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension
And i also have a flatten query
WITH analytics AS (
SELECT
event_name
, event_date
, event_timestamp
, TIMESTAMP_MICROS(event_timestamp) as event_datetime
, event_previous_timestamp
, event_value_in_usd
, event_bundle_sequence_id
, event_server_timestamp_offset
, user_id
, user_pseudo_id
------------------------------- Event Parameters ------------------------------
, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Appearance') AS Appearance
, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'CollectionSet') AS CollectionSet
, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'Connectivity') AS Connectivity
...
FROM `ga-xxx.analytics_xxx.events_*`
--, UNNEST(items) AS items
WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315'
)
, analytics_updated AS (
SELECT
*
--CONCAT(ga_session_id,".",user_pseudo_id) AS unique_session_id
--, *
FROM analytics
)
SELECT * FROM analytics_updated
I'm trying to combine these 2 together. I use for...end for; to loop through the parameters in order to auto generate the select statement in the query. Here's what i have the moment, however, i'm stucked at the case ...when...then...end statement. How do i join the value into the query string (set value_type_statement = ...) into flatten_statement var? Thanks in advance.
create or replace procedure `ga-xx.sandbox_xx`.proc_app ()
begin
declare value_type_statement string;
declare parameters_statement string;
declare flatten_statement string;
set flatten_statement = "select "
|| "event_name"
|| ", event_date"
|| ", event_timestamp"
|| ", TIMESTAMP_MICROS(event_timestamp) as event_datetime"
|| ", event_previous_timestamp"
|| ", event_value_in_usd"
|| ", event_bundle_sequence_id"
|| ", event_server_timestamp_offset"
|| ", user_id"
|| ", user_pseudo_id" ;
for iterator in (with events as (
SELECT
event_name,
param.key as name,
CASE
WHEN param.value.string_value is not null THEN "string"
WHEN param.value.int_value is not null THEN "int"
WHEN param.value.double_value is not null THEN "double"
WHEN param.value.float_value is not null THEN "float"
END as value
FROM `ga-xx.analytics_xxx.events_*`,
UNNEST(event_params) as param
WHERE _TABLE_SUFFIX between '20240315' and '20240315'
--DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
GROUP BY event_name, name, value
)
select event_name, dimension.name, dimension.value from (
SELECT
event_name,
ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY event_name
ORDER BY event_name
), unnest(attribute) as dimension)
do
-- IF( iterator.value== "string" , 'true', 'false') AS result
-- CASE
-- WHEN "string" THEN "(SELECT value.string_value "
-- WHEN "int" THEN "(SELECT value.int_value "
-- WHEN "double" THEN "(SELECT value.double_value "
-- WHEN "float" THEN "(SELECT value.float_value "
-- END as value
IF (iterator.value = "string") THEN set value_type_statement = ' (SELECT value.string_value '; end if;
IF (iterator.value = "int") THEN set value_type_statement = ' (SELECT value.int_value '; end if;
IF (iterator.value = "double") THEN set value_type_statement = ' (SELECT value.double_value '; end if;
IF (iterator.value = "float") THEN set value_type_statement = ' (SELECT value.float_value '; end if;
set flatten_statement = flatten_statement
|| value_type_statement
|| " FROM UNNEST(event_params) WHERE key = " || iterator.name || "AS" || iterator.name;
end for;
set flatten_statement = flatten_statement;
end;
--call `ga-xx.sandbox_xx`.proc_app ();
Solved! Go to Solution.
To build a query string dynamically within a BigQuery stored procedure, especially when you need to handle different data types with conditional logic, you will have to piece together your SQL string by checking each condition and appending the appropriate part of the SQL based on the outcome. While you were thinking of using a CASE
statement within a loop to handle these conditions, BigQuery's scripting doesn't support directly assigning a variable's value with a CASE
statement outside a SELECT
query. A more straightforward approach is to use IF
statements or to integrate the CASE
statement within the construction of your query string itself.
Here's a revised approach to build your flatten_statement string within the loop, using IF statements for clarity and simplicity:
CREATE OR REPLACE PROCEDURE `ga-xxx.sandbox_x`.proc_app()
BEGIN
DECLARE value_type_statement STRING DEFAULT '';
DECLARE flatten_statement STRING DEFAULT 'SELECT
event_name,
event_date,
event_timestamp,
TIMESTAMP_MICROS(event_timestamp) AS event_datetime,
event_previous_timestamp,
event_value_in_usd,
event_bundle_sequence_id,
event_server_timestamp_offset,
user_id,
user_pseudo_id';
-- Assuming you want to append to the flatten_statement for each parameter
FOR iterator IN (
WITH events AS (
SELECT
event_name,
param.key AS name,
CASE
WHEN param.value.string_value IS NOT NULL THEN 'string'
WHEN param.value.int_value IS NOT NULL THEN 'int'
WHEN param.value.double_value IS NOT NULL THEN 'double'
WHEN param.value.float_value IS NOT NULL THEN 'float'
END AS value
FROM `ga-xx.analytics_xxx.events_*`,
UNNEST(event_params) AS param
WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315'
GROUP BY event_name, name, value
)
SELECT event_name, dimension.name, dimension.value
FROM ( SELECT event_name, ARRAY_AGG(STRUCT(name, value) ORDER BY name) AS attribute
FROM events
GROUP BY event_name
ORDER BY event_name ),
UNNEST(attribute) AS dimension
) DO
SET value_type_statement = CASE iterator.value
WHEN 'string' THEN 'STRING'
WHEN 'int' THEN 'INT64'
WHEN 'double' THEN 'FLOAT64'
WHEN 'float' THEN 'FLOAT64'
ELSE 'STRING' -- Default case if needed
END;
-- Append to the flatten_statement for each parameter
SET flatten_statement = CONCAT(flatten_statement,
', (SELECT value.', value_type_statement, '_value
FROM UNNEST(event_params)
WHERE key = ''', iterator.name, ''') AS ', iterator.name);
END FOR;
-- Debug: Output the constructed query (remove in production)
SELECT flatten_statement;
END;
Here are several strategies to enhance performance and maintainability:
1. Minimize Dynamic SQL Construction Complexity
2. Use Pre-aggregated Metadata
3. Optimize Iteration Logic
4. Leverage BigQuery Functions
5. Streamline SQL String Concatenation
Without altering the core logic of your procedure significantly, here are a few adjustments for potential optimizations:
-- Example of a simplified approach (conceptual, adjust as needed):
CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app(startDate STRING, endDate STRING, OUT sql STRING)
BEGIN
-- Base SQL statement
DECLARE base_sql STRING DEFAULT "WITH analytics AS (SELECT event_name, ... FROM `ga-xx.analytics_xx.events_intraday_*` WHERE _TABLE_SUFFIX BETWEEN '" || startDate || "' AND '" || endDate || "') SELECT * FROM analytics;";
-- Pre-aggregated metadata tables (param_meta, user_prop_meta)
DECLARE params_sql ARRAY<STRING>;
DECLARE user_props_sql ARRAY<STRING>;
-- Construct SQL fragments (optimized)
FOR param IN (SELECT extract_query FROM `ga-xx.sandbox_xx.param_meta`) DO
SET params_sql = ARRAY_CONCAT(params_sql, [param.extract_query]);
END FOR;
-- ... (similarly for user_props_sql)
-- Combine into final SQL (more efficient concatenation)
SET sql = CONCAT(base_sql, STRING_AGG(params_sql, ', '), STRING_AGG(user_props_sql, ', '));
END;
Hi @sysph ,
The error you're encountering occurs because BigQuery doesn't support directly inserting variables like full_table_name
into the FROM
clause of a standard SQL query. This is a security and optimization feature.
To work around this, you need to:
EXECUTE IMMEDIATE
.CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app (
project_id STRING,
dataset_name STRING,
table_name STRING,
startDate STRING,
endDate STRING,
OUT sql STRING
)
BEGIN
-- Declare variables
DECLARE full_table_name STRING;
DECLARE query_statement STRING;
-- Construct the full table name
SET full_table_name = CONCAT('`', project_id, '.', dataset_name, '.', table_name, '`');
-- Build the dynamic SQL statement
SET query_statement = CONCAT(
"WITH analytics AS (",
"SELECT event_name, event_date, event_timestamp, ",
" TIMESTAMP_MICROS(event_timestamp) as event_datetime, ",
" event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, ",
" event_server_timestamp_offset, user_id, user_pseudo_id ",
"FROM ", full_table_name, " ",
"WHERE _TABLE_SUFFIX BETWEEN '", startDate, "' AND '", endDate, "'"
")"
);
-- Add any further dynamic parts to the query as needed
-- Finalize the SQL statement
SET query_statement = CONCAT(query_statement, " SELECT * FROM analytics;");
-- Return the constructed SQL statement
SET sql = query_statement;
END;
How to Execute:
DECLARE sql STRING;
CALL `ga-xx.sandbox_xx`.proc_app(
'ga-xx',
'analytics_zz',
'events_intraday_*',
'20240319',
'20240319',
sql
);
EXECUTE IMMEDIATE sql;
Hi @sysph ,
You've encountered a common challenge: not all fields you might expect from the GA4 schema are present in every dataset. This discrepancy can occur due to several reasons:
Custom Events and Parameters: GA4 allows for custom events and parameters, which means not all datasets will have the same schema. Some fields might be present in one dataset but absent in another due to the custom nature of event tracking set up by different users.
Schema Evolution: The schema for GA4 data in BigQuery can evolve over time as new features are added to GA4 and as Google modifies the data export format. This can lead to situations where newer datasets have fields that older datasets do not.
Export Configurations: The configuration of the GA4 property and its data export to BigQuery might affect which fields are available. Some fields might be omitted based on user settings or the specific events being tracked.
To handle these discrepancies and efficiently retrieve available fields, consider the following strategies:
Dynamic Schema Discovery: Your current approach of using INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
for dynamic schema discovery is on the right track. This allows you to adapt to the available schema dynamically.
Graceful Handling of Missing Fields: When constructing your query, it might be beneficial to check if a field exists in the dataset before attempting to include it in your query. This can be challenging to do directly in SQL but consider preprocessing steps that might filter out non-existent fields based on your dynamic schema discovery.
Fallback Mechanisms: For critical fields that might be missing, consider implementing fallback mechanisms where possible. For example, if a specific user property is not available, you might be able to infer or approximate its value from other available data.
Schema Versioning: Keep track of schema changes over time. If you're working with historical data, understanding when certain fields were introduced or deprecated can help you adjust your queries accordingly.
To build a query string dynamically within a BigQuery stored procedure, especially when you need to handle different data types with conditional logic, you will have to piece together your SQL string by checking each condition and appending the appropriate part of the SQL based on the outcome. While you were thinking of using a CASE
statement within a loop to handle these conditions, BigQuery's scripting doesn't support directly assigning a variable's value with a CASE
statement outside a SELECT
query. A more straightforward approach is to use IF
statements or to integrate the CASE
statement within the construction of your query string itself.
Here's a revised approach to build your flatten_statement string within the loop, using IF statements for clarity and simplicity:
CREATE OR REPLACE PROCEDURE `ga-xxx.sandbox_x`.proc_app()
BEGIN
DECLARE value_type_statement STRING DEFAULT '';
DECLARE flatten_statement STRING DEFAULT 'SELECT
event_name,
event_date,
event_timestamp,
TIMESTAMP_MICROS(event_timestamp) AS event_datetime,
event_previous_timestamp,
event_value_in_usd,
event_bundle_sequence_id,
event_server_timestamp_offset,
user_id,
user_pseudo_id';
-- Assuming you want to append to the flatten_statement for each parameter
FOR iterator IN (
WITH events AS (
SELECT
event_name,
param.key AS name,
CASE
WHEN param.value.string_value IS NOT NULL THEN 'string'
WHEN param.value.int_value IS NOT NULL THEN 'int'
WHEN param.value.double_value IS NOT NULL THEN 'double'
WHEN param.value.float_value IS NOT NULL THEN 'float'
END AS value
FROM `ga-xx.analytics_xxx.events_*`,
UNNEST(event_params) AS param
WHERE _TABLE_SUFFIX BETWEEN '20240315' AND '20240315'
GROUP BY event_name, name, value
)
SELECT event_name, dimension.name, dimension.value
FROM ( SELECT event_name, ARRAY_AGG(STRUCT(name, value) ORDER BY name) AS attribute
FROM events
GROUP BY event_name
ORDER BY event_name ),
UNNEST(attribute) AS dimension
) DO
SET value_type_statement = CASE iterator.value
WHEN 'string' THEN 'STRING'
WHEN 'int' THEN 'INT64'
WHEN 'double' THEN 'FLOAT64'
WHEN 'float' THEN 'FLOAT64'
ELSE 'STRING' -- Default case if needed
END;
-- Append to the flatten_statement for each parameter
SET flatten_statement = CONCAT(flatten_statement,
', (SELECT value.', value_type_statement, '_value
FROM UNNEST(event_params)
WHERE key = ''', iterator.name, ''') AS ', iterator.name);
END FOR;
-- Debug: Output the constructed query (remove in production)
SELECT flatten_statement;
END;
Hi @ms4446 Thank you so much for replying. I got it fixed before i saw your message. I'll definitely test your code snippet. However, might not the quickest query. Anyway i could optimise it? I will modify to give default value in declaring variable. thanks again. have a nice weekend.
create or replace procedure `ga-xx.sandbox_xx`.proc_app (startDate STRING, endDate STRING, OUT sql STRING)
begin
declare value_type_statement string;
declare user_prop_statement string;
declare flatten_statement string;
set flatten_statement = "WITH analytics AS ("
|| "SELECT "
|| "event_name"
|| ", event_date"
|| ", event_timestamp"
|| ", TIMESTAMP_MICROS(event_timestamp) as event_datetime"
|| ", event_previous_timestamp"
|| ", event_value_in_usd"
|| ", event_bundle_sequence_id"
|| ", event_server_timestamp_offset"
|| ", user_id"
|| ", user_pseudo_id" ;
for iterator in (with events as (
SELECT
DISTINCT param.key as name,
CASE
WHEN param.value.string_value is not null THEN ', (SELECT value.string_value ' --"string"
WHEN param.value.int_value is not null THEN ', (SELECT value.int_value ' --"int"
WHEN param.value.double_value is not null THEN ', (SELECT value.double_value ' --"double"
WHEN param.value.float_value is not null THEN ', (SELECT value.float_value ' --"float"
END as value
FROM `ga-xx.analytics_xx.events_intraday_*`,
UNNEST(event_params) as param
WHERE _TABLE_SUFFIX between startDate and endDate
--DATE_sub(current_date(), interval 7 day) and DATE_sub(current_date(), interval 1 day)
GROUP BY name,value
)
select dimension.name, dimension.value from (
SELECT
ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY name
), unnest(attribute) as dimension
LIMIT 5)
do
-- IF (iterator.value = "string") THEN set value_type_statement = ', (SELECT value.string_value '; end if;
-- IF (iterator.value = "int") THEN set value_type_statement = ', (SELECT value.int_value '; end if;
-- IF (iterator.value = "double") THEN set value_type_statement = ', (SELECT value.double_value '; end if;
-- IF (iterator.value = "float") THEN set value_type_statement = ', (SELECT value.float_value '; end if;
set flatten_statement = flatten_statement
--value_type_statement
|| iterator.value
|| " FROM UNNEST(event_params) WHERE key = '" || iterator.name || "') AS " || iterator.name;
-- , (SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'KGVideoElapsedSeconds') AS KGVideoElapsedSeconds
end for;
for iterator_userprop in (with user_props as (
select
DISTINCT user_prop.key as user_prop_key,
case when user_prop.value.string_value is not null then ', (SELECT value.string_value '
when user_prop.value.int_value is not null then ', (SELECT value.int_value ' --"int"
when user_prop.value.double_value is not null then ', (SELECT value.double_value ' --"double"
when user_prop.value.float_value is not null then ', (SELECT value.float_value ' --"float"
else null end as user_prop_value
from
`ga-xx.analytics_xx.events_intraday_*`,
unnest(user_properties) as user_prop
group by
user_prop_key,
user_prop_value
order by
user_prop_key
)
select dimension.user_prop_key, dimension.user_prop_value from (
SELECT
ARRAY_AGG(struct(user_prop_key, user_prop_value) order by user_prop_key) as up_attribute
FROM user_props
GROUP BY user_prop_key
ORDER BY user_prop_key
), unnest(up_attribute) as dimension)
do
set flatten_statement = flatten_statement
|| iterator_userprop.user_prop_value
|| " FROM UNNEST(user_properties) WHERE key = '" || iterator_userprop.user_prop_key || "') AS " || iterator_userprop.user_prop_key;
end for;
set flatten_statement = flatten_statement
|| " , privacy_info.analytics_storage AS analytics_storage"
|| " , privacy_info.ads_storage AS ads_storage"
|| " , privacy_info.uses_transient_token AS uses_transient_token"
|| " , user_first_touch_timestamp "
|| " , user_ltv.revenue AS user_ltv_revenue"
|| " , user_ltv.currency AS user_ltv_currency"
|| " FROM `ga-xx.analytics_xx.events_intraday_*` "
|| " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' "
|| " ) "
|| " SELECT * FROM analytics;";
--EXECUTE IMMEDIATE (flatten_statement);
set sql = flatten_statement;
end;
--DECLARE sql STRING;
--CALL `ga-xx.sandbox_xx.proc_xx_app`('20240315','20240315',sql);
--SELECT sql;
Here are several strategies to enhance performance and maintainability:
1. Minimize Dynamic SQL Construction Complexity
2. Use Pre-aggregated Metadata
3. Optimize Iteration Logic
4. Leverage BigQuery Functions
5. Streamline SQL String Concatenation
Without altering the core logic of your procedure significantly, here are a few adjustments for potential optimizations:
-- Example of a simplified approach (conceptual, adjust as needed):
CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app(startDate STRING, endDate STRING, OUT sql STRING)
BEGIN
-- Base SQL statement
DECLARE base_sql STRING DEFAULT "WITH analytics AS (SELECT event_name, ... FROM `ga-xx.analytics_xx.events_intraday_*` WHERE _TABLE_SUFFIX BETWEEN '" || startDate || "' AND '" || endDate || "') SELECT * FROM analytics;";
-- Pre-aggregated metadata tables (param_meta, user_prop_meta)
DECLARE params_sql ARRAY<STRING>;
DECLARE user_props_sql ARRAY<STRING>;
-- Construct SQL fragments (optimized)
FOR param IN (SELECT extract_query FROM `ga-xx.sandbox_xx.param_meta`) DO
SET params_sql = ARRAY_CONCAT(params_sql, [param.extract_query]);
END FOR;
-- ... (similarly for user_props_sql)
-- Combine into final SQL (more efficient concatenation)
SET sql = CONCAT(base_sql, STRING_AGG(params_sql, ', '), STRING_AGG(user_props_sql, ', '));
END;
Thanks @ms4446 it's so helpful!
I have a tiny question here. I pass in the project id, dataset name, table name, and date range to the stored procedure. There's a error :invalid value: Table "full_table_name" must be qualified with a dataset (e.g. dataset.table) something wrong with the full_table_name variable which i couldn't make it work.
create or replace procedure `ga-xx.sandbox_xx`.proc_xapp (project_id STRING, dataset_name STRING, table_name STRING, startDate STRING, endDate STRING, OUT sql STRING)
begin
-- declare row_count INT64;
declare full_table_name string;
declare value_type_statement string;
declare user_prop_statement string;
declare flatten_statement string DEFAULT "WITH analytics AS ("
|| "SELECT "
|| "event_name \n"
|| ", event_date \n"
|| ", event_timestamp \n"
|| ", TIMESTAMP_MICROS(event_timestamp) as event_datetime \n"
|| ", event_previous_timestamp \n"
|| ", event_value_in_usd \n"
|| ", event_bundle_sequence_id \n"
|| ", event_server_timestamp_offset \n"
|| ", user_id \n"
|| ", user_pseudo_id \n";
set full_table_name = '`'||project_id||'.'||dataset_name||'.'||table_name||'`';
for iterator in (with events as (
SELECT
DISTINCT param.key as name,
CASE
WHEN param.value.string_value is not null THEN ', (SELECT value.string_value ' --"string"
WHEN param.value.int_value is not null THEN ', (SELECT value.int_value ' --"int"
WHEN param.value.double_value is not null THEN ', (SELECT value.double_value ' --"double"
WHEN param.value.float_value is not null THEN ', (SELECT value.float_value ' --"float"
END as value
x
FROM full_table_name
WHERE _TABLE_SUFFIX between startDate and endDate
GROUP BY name,value
)
select dimension.name, dimension.value from (
SELECT
ARRAY_AGG(struct(name, value) order by name) as attribute
FROM events
GROUP BY name
), unnest(attribute) as dimension
LIMIT 5)
do
-- IF (iterator.value = "string") THEN set value_type_statement = ', (SELECT value.string_value '; end if;
-- IF (iterator.value = "int") THEN set value_type_statement = ', (SELECT value.int_value '; end if;
-- IF (iterator.value = "double") THEN set value_type_statement = ', (SELECT value.double_value '; end if;
-- IF (iterator.value = "float") THEN set value_type_statement = ', (SELECT value.float_value '; end if;
set flatten_statement = flatten_statement
|| iterator.value
|| " FROM UNNEST(event_params) WHERE key = '" || iterator.name || "') AS " || iterator.name||" \n";
end for;
for iterator_userprop in (with user_props as (
select
DISTINCT user_prop.key as user_prop_key,
case when user_prop.value.string_value is not null then ', (SELECT value.string_value ' --"string"
when user_prop.value.int_value is not null then ', (SELECT value.int_value ' --"int"
when user_prop.value.double_value is not null then ', (SELECT value.double_value ' --"double"
when user_prop.value.float_value is not null then ', (SELECT value.float_value ' --"float"
else null
end as user_prop_value
from
`ga-xx.analytics_xx.events_intraday_*`,unnest(user_properties) as user_prop
WHERE _TABLE_SUFFIX between startDate and endDate
group by
user_prop_key,
user_prop_value
order by
user_prop_key
)
select dimension.user_prop_key, dimension.user_prop_value from (
SELECT
ARRAY_AGG(struct(user_prop_key, user_prop_value) order by user_prop_key) as up_attribute
FROM user_props
GROUP BY user_prop_key
ORDER BY user_prop_key
), unnest(up_attribute) as dimension)
do
set flatten_statement = flatten_statement
|| iterator_userprop.user_prop_value
|| " FROM UNNEST(user_properties) WHERE key = '" || iterator_userprop.user_prop_key || "') AS " || iterator_userprop.user_prop_key||"\n";
end for;
set flatten_statement = flatten_statement
|| " FROM `ga-xx.analytics_xx.events_intraday_*` "
|| " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' "
|| " ) "
|| " SELECT * FROM analytics;";
--EXECUTE IMMEDIATE (flatten_statement);
set sql = flatten_statement;
end;
DECLARE sql STRING;
CALL `ga-xx.sandbox_yy.proc_app`('ga-xx','analytics_zz','events_intraday_*','20240319','20240319',sql);
SELECT sql;
EXECUTE IMMEDIATE sql;
Once it's pass the test, i'll start optimise the code. Thanks for helping
Hi @sysph ,
The error you're encountering occurs because BigQuery doesn't support directly inserting variables like full_table_name
into the FROM
clause of a standard SQL query. This is a security and optimization feature.
To work around this, you need to:
EXECUTE IMMEDIATE
.CREATE OR REPLACE PROCEDURE `ga-xx.sandbox_xx`.proc_app (
project_id STRING,
dataset_name STRING,
table_name STRING,
startDate STRING,
endDate STRING,
OUT sql STRING
)
BEGIN
-- Declare variables
DECLARE full_table_name STRING;
DECLARE query_statement STRING;
-- Construct the full table name
SET full_table_name = CONCAT('`', project_id, '.', dataset_name, '.', table_name, '`');
-- Build the dynamic SQL statement
SET query_statement = CONCAT(
"WITH analytics AS (",
"SELECT event_name, event_date, event_timestamp, ",
" TIMESTAMP_MICROS(event_timestamp) as event_datetime, ",
" event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, ",
" event_server_timestamp_offset, user_id, user_pseudo_id ",
"FROM ", full_table_name, " ",
"WHERE _TABLE_SUFFIX BETWEEN '", startDate, "' AND '", endDate, "'"
")"
);
-- Add any further dynamic parts to the query as needed
-- Finalize the SQL statement
SET query_statement = CONCAT(query_statement, " SELECT * FROM analytics;");
-- Return the constructed SQL statement
SET sql = query_statement;
END;
How to Execute:
DECLARE sql STRING;
CALL `ga-xx.sandbox_xx`.proc_app(
'ga-xx',
'analytics_zz',
'events_intraday_*',
'20240319',
'20240319',
sql
);
EXECUTE IMMEDIATE sql;
Apologise here. The code snippet contains repeating query.
I got another question. I use the code snippet below to fetch the sub-column (the other nested and un-nested columns). (Some of the fields name is not existing/matching in the GA4 table (for app). Why is it? What would be the best way to retrieve those fields? Thanks
FOR iterator_nonnested in (WITH table_columns AS (
SELECT
DISTINCT column_name, field_path, data_type
FROM
`ga-xx`.analytics_yy.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
-- WHERE
-- table_name = 'events_'||startDate
)
SELECT *
FROM table_columns
WHERE column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','user_info','event_params','user_properties','items','audiences','predictions','occurrence_date','last_updated_date')
AND field_path NOT IN ('device.unified_screen_name','user_ltv.revenue_in_usd','user_ltv.sessions','user_ltv.engagement_time_millis','user_ltv.purchases','user_ltv.engaged_sessions','user_ltv.session_duration_micros','privacy_info.is_limited_ad_tracking','privacy_info.is_ads_personalization_allowed'))
DO
-- if column_name != field_path then directly add to flatten_satement
IF (iterator_nonnested.column_name != iterator_nonnested.field_path) THEN
set flatten_statement = flatten_statement
|| ", " || iterator_nonnested.field_path || " AS " || REPLACE( iterator_nonnested.field_path, '.', '_') || " \n";
END IF;
-- if column_name = field_path and count =1 then add it to flatten_statement eg. is_active_user only 1
IF (iterator_nonnested.column_name = iterator_nonnested.field_path) THEN
for iterator_column_count in (WITH table_column_count AS (
SELECT
DISTINCT column_name, field_path, data_type
FROM
`ga-xx`.analytics_yy.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
-- WHERE
-- table_name = 'events_'||startDate
)
SELECT count(*) AS column_name_count,column_name
FROM table_column_count
WHERE column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','user_info','event_params','user_properties','items','audiences','predictions','occurrence_date','last_updated_date')
AND field_path NOT IN ('device.unified_screen_name','user_ltv.revenue_in_usd','user_ltv.sessions','user_ltv.engagement_time_millis','user_ltv.purchases','user_ltv.engaged_sessions','user_ltv.session_duration_micros','privacy_info.is_limited_ad_tracking','privacy_info.is_ads_personalization_allowed')
group by column_name
having column_name = iterator_nonnested.column_name)
do
IF (iterator_column_count.column_name_count= 1) THEN
set flatten_statement = flatten_statement || ", " || iterator_nonnested.field_path || " \n";
END IF;
end for;
END IF;
END FOR;
Hi @sysph ,
You've encountered a common challenge: not all fields you might expect from the GA4 schema are present in every dataset. This discrepancy can occur due to several reasons:
Custom Events and Parameters: GA4 allows for custom events and parameters, which means not all datasets will have the same schema. Some fields might be present in one dataset but absent in another due to the custom nature of event tracking set up by different users.
Schema Evolution: The schema for GA4 data in BigQuery can evolve over time as new features are added to GA4 and as Google modifies the data export format. This can lead to situations where newer datasets have fields that older datasets do not.
Export Configurations: The configuration of the GA4 property and its data export to BigQuery might affect which fields are available. Some fields might be omitted based on user settings or the specific events being tracked.
To handle these discrepancies and efficiently retrieve available fields, consider the following strategies:
Dynamic Schema Discovery: Your current approach of using INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
for dynamic schema discovery is on the right track. This allows you to adapt to the available schema dynamically.
Graceful Handling of Missing Fields: When constructing your query, it might be beneficial to check if a field exists in the dataset before attempting to include it in your query. This can be challenging to do directly in SQL but consider preprocessing steps that might filter out non-existent fields based on your dynamic schema discovery.
Fallback Mechanisms: For critical fields that might be missing, consider implementing fallback mechanisms where possible. For example, if a specific user property is not available, you might be able to infer or approximate its value from other available data.
Schema Versioning: Keep track of schema changes over time. If you're working with historical data, understanding when certain fields were introduced or deprecated can help you adjust your queries accordingly.
Here's my final stored procedure to generate flatten query and it's faster than the previous one that uses lots for... loops. Please let me know if any better way to improve this. Thanks in advance
CREATE OR REPLACE PROCEDURE `ga-xx.dataset`.proc_app_optimised (project_id STRING, dataset_name STRING, table_name STRING, startDate STRING, endDate STRING, OUT sql STRING)
BEGIN
DECLARE table_name_argument STRING;
DECLARE full_table_name string;
-- Base SQL statement
DECLARE flatten_statement string DEFAULT "WITH analytics AS ("
|| "SELECT "
|| "event_name \n"
|| ", event_date \n"
|| ", event_timestamp \n"
|| ", TIMESTAMP_MICROS(event_timestamp) as event_datetime \n"
|| ", event_previous_timestamp \n"
|| ", event_value_in_usd \n"
|| ", event_bundle_sequence_id \n"
|| ", event_server_timestamp_offset \n"
|| ", user_id \n"
|| ", user_pseudo_id \n";
-- Pre-aggregated metadata tables (param_meta, user_prop_meta)
DECLARE event_params_sql string;
DECLARE user_props_sql string;
DECLARE other_unnested_columns_sql string;
--DECLARE params_sql ARRAY<STRING>;
-- DECLARE user_props_sql ARRAY<STRING>;
-- Construct the full table name
SET full_table_name = CONCAT(project_id, '.', dataset_name, '.', table_name);
-- Construct all event paramters
EXECUTE IMMEDIATE format(r"""
with events as (
SELECT
DISTINCT param.key as name,
CASE
WHEN param.value.string_value is not null THEN ' (SELECT value.string_value '
WHEN param.value.int_value is not null THEN ' (SELECT value.int_value '
WHEN param.value.double_value is not null THEN ' (SELECT value.double_value '
WHEN param.value.float_value is not null THEN ' (SELECT value.float_value '
END as value
FROM `%s`,UNNEST(event_params) as param
WHERE _TABLE_SUFFIX between '%s' and '%s'
GROUP BY name,value)
SELECT STRING_AGG(value|| "FROM UNNEST(event_params) WHERE key = '" || name || "') AS " || name || '\n' ORDER BY name)
FROM events
"""
,full_table_name,startDate, endDate) INTO event_params_sql;
set flatten_statement = flatten_statement || ',' || event_params_sql;
--Construct all user properties
EXECUTE IMMEDIATE format(r"""
with user_props as (
select
DISTINCT user_prop.key as user_prop_key,
case when user_prop.value.string_value is not null then ' (SELECT value.string_value '
when user_prop.value.int_value is not null then ' (SELECT value.int_value '
when user_prop.value.double_value is not null then ' (SELECT value.double_value '
when user_prop.value.float_value is not null then ' (SELECT value.float_value '
else null
end as user_prop_value
from `%s`,unnest(user_properties) as user_prop
WHERE _TABLE_SUFFIX between '%s' and '%s'
group by
user_prop_key,
user_prop_value
order by
user_prop_key
)
SELECT STRING_AGG(user_prop_value|| "FROM UNNEST(user_properties) WHERE key = '" || user_prop_key || "') AS " || user_prop_key || '\n' ORDER BY user_prop_key)
FROM user_props
"""
,full_table_name,startDate, endDate) INTO user_props_sql;
set flatten_statement = flatten_statement || ',' || user_props_sql;
IF (table_name = 'events_intraday_*') THEN
set table_name_argument = 'events_intraday';
ELSE
set table_name_argument = 'events';
END IF;
--Construct all other column
EXECUTE IMMEDIATE format(r"""
WITH table_columns AS (
SELECT DISTINCT column_name, field_path, data_type
FROM `%s.%s.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_catalog = '%s'
AND table_schema = '%s'
AND table_name BETWEEN '%s_%s' and '%s_%s'
AND data_type IN ('STRING','INT64','FLOAT64','NUMERIC','TIMESTAMP','DATE','TIME','DATETIME','BOOL','BIGNUMERIC','ARRAY')
AND column_name NOT IN ('event_name', 'event_date', 'event_timestamp','event_previous_timestamp', 'event_value_in_usd','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','pseudo_user_id','event_params','user_properties','items'))
SELECT STRING_AGG(field_path || " AS " || REPLACE(field_path, '.', '_') || ' \n' ORDER BY field_path) FROM table_columns
"""
,project_id,dataset_name,project_id,dataset_name,table_name_argument,startDate,table_name_argument,endDate) INTO other_unnested_columns_sql;
set flatten_statement = flatten_statement || ',' || other_unnested_columns_sql;
set flatten_statement = flatten_statement
|| " FROM `"||full_table_name||"` \n"
|| " WHERE _TABLE_SUFFIX BETWEEN '"||startDate||"' AND '"||endDate||"' \n"
|| " ) \n"
|| " SELECT * FROM analytics;";
set sql = flatten_statement;
end;
-- DECLARE sql STRING;
-- CALL `ga-xx.sandbox_xx.proc_app_optimise`('ga-xx','analytics_xx','events_*','20240327','20240328',sql);
-- SELECT sql;
-- EXECUTE IMMEDIATE (sql);
Great job on updating your stored procedure. The improvements you've made, particularly moving away from a loop-heavy approach, should enhance both performance and maintainability.
Some Observations
Your proficient use of EXECUTE IMMEDIATE
in conjunction with FORMAT()
effectively streamlines the construction of subqueries. This strategy adeptly handles variations in event parameters, user properties, and other columns, playing a pivotal role in the observed performance enhancements.
Your strategic limitation of included columns based on their data types showcases a deep understanding of performance optimization. By customizing your query's output, you minimize processing overhead and facilitate downstream data handling.
Here are some suggestions for enhancement:
Parameter Validation: Strengthening the stored procedure's reliability through input validation for parameters such as project_id
, dataset_name
, table_name
, startDate
, and endDate
is essential. Implementing checks for non-null values, verifying date formats, and ensuring logical date sequencing can mitigate common issues.
Error Handling: Adopting BigQuery's BEGIN...EXCEPTION...END
structure allows for more sophisticated error management within your dynamic SQL constructs. This addition is vital for easing debugging and maintenance, particularly in production environments.
Exploring ARRAY_AGG: Although STRING_AGG
currently fulfills your requirements, ARRAY_AGG
might provide benefits by retaining data type information within your query's output. This could be particularly advantageous for analyses dependent on preserving original data types.
Strategic Use of Temporary Tables: If your final query often references the same intermediate results, employing temporary tables to materialize these results could enhance performance. Benchmarking with realistic datasets will inform this decision.
Enhancing ARRAY_AGG Usage
The example you provided for utilizing ARRAY_AGG
to preserve type information within the flattened output is insightful. It underscores how maintaining data types can enrich subsequent analyses.