Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Google Analytics 4 RAW Data Export to GCS. Is Approach valid?

Hi,

Goal: To export Google Analytics 4 events data to the GCS bucket on daily basis. 

Approach: Exporting Table Data to BigQuery

Example:

EXPORT DATA
  OPTIONS
(
    uri
= 'gs://bucket/folder/*.csv',
    format
= 'CSV',
    overwrite
= true,
    header
= true,
    field_delimiter
= ';')
AS (
 
SELECT field1, field2
 
FROM mydataset.table1
 
ORDER BY field1
);

, as the GA data is nested the file format would be either Parquet or Avro.  

Problem: As per the official document of [GA4] BigQuery Export schema, In the events table, "Not all devices on which events are triggered send their data to Analytics on the same day the events are triggered. To account for this latency, Analytics will update the daily tables (events_YYYYMMDD) with events for those dates for up to three days after the dates of the events. Events will have the correct time stamp regardless of arriving late. Events that arrive after that three-day window are not recorded"

I've checked the same in our analytics dataset - for today i.e 15th Feb 2023 the tables events_20230214, events_20230213, and events_20230212 the last modified date is of 15th Feb 2023, so GA is still appending data into the previous three days events table. In order to schedule a job that exports data from BQ to GCS in APPEND mode into respective date-wise folders, I am thinking of using the change history feature of BQ, only scanning data INSERTED on 15th Feb 2023

 

SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230214`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL

SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230213`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230212`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")

 

 

GCS URI will be like

gs://bucket/table=events_20230212/change_date=20230215/*

gs://bucket/table=events_20230212/change_date=20230214/*

gs://bucket/table=events_20230212/change_date=20230213/*

 

Question: Is the above approach the right way to move ahead or is there any better alternate way?

Thanks

0 2 2,556
2 REPLIES 2

Hi @darsh_shukla 

If I understand your goals and your questions correctly, your approach is the right way to move ahead. You are using the change history feature of BigQuery to identify the data that was inserted on February 15, 2023, and then exporting that data to Google Cloud Storage. This will ensure that you are only exporting the data that is new, and that you are not overwriting any existing data.

Here are some other things to keep in mind when exporting Google Analytics 4 data to Google Cloud Storage:

  • You can use the BigQuery Export API to export data to Google Cloud Storage in batch mode. This will allow you to export a larger amount of data at once.
  • You can use the BigQuery Export API to export data to Google Cloud Storage in real time. This will allow you to export data as it is being generated by Google Analytics 4.
  • You can use the BigQuery Export API to export data to Google Cloud Storage in a specific format. You can choose to export data in CSV, Avro, or Parquet format.
  • You can use the BigQuery Export API to export data to Google Cloud Storage in a specific location. You can choose to export data to a specific bucket or folder in Google Cloud Storage.

I hope this helps!

 

 
 

@darsh_shukla thanks for the observation and the proposed solution. We have a framework which taking an date of the format 'YYYYMMDD' it generates an export query and i have generated the query some thing like this. But iam getting an error on line 8 saying thats not the signature for appends function.

WITH 
changed_records AS (
  SELECT
  *,
  DATE(_CHANGE_TIMESTAMP) AS change_date
  FROM
    APPENDS(TABLE `project.dataset.events_20240727`, 
            FORMAT_TIMESTAMP('%FT%T.0000 UTC', TIMESTAMP(PARSE_DATE('%Y%m%d', '20240801'), 'UTC')),
          '2024-08-01T23:59:59.999999 UTC'
          )
)

select event_date, change_date, count(*)
from changed_records
group by 1, 2 order by change_date
limit 10;

 

Any suggestions or workarounds will be highly regarded.

Thanks,
Jai