Hi,
Goal: To export Google Analytics 4 events data to the GCS bucket on daily basis.
Approach: Exporting Table Data to BigQuery
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
, as the GA data is nested the file format would be either Parquet or Avro.
Problem: As per the official document of [GA4] BigQuery Export schema, In the events table, "Not all devices on which events are triggered send their data to Analytics on the same day the events are triggered. To account for this latency, Analytics will update the daily tables (events_YYYYMMDD) with events for those dates for up to three days after the dates of the events. Events will have the correct time stamp regardless of arriving late. Events that arrive after that three-day window are not recorded"
I've checked the same in our analytics dataset - for today i.e 15th Feb 2023 the tables events_20230214, events_20230213, and events_20230212 the last modified date is of 15th Feb 2023, so GA is still appending data into the previous three days events table. In order to schedule a job that exports data from BQ to GCS in APPEND mode into respective date-wise folders, I am thinking of using the change history feature of BQ, only scanning data INSERTED on 15th Feb 2023
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230214`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230213`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
UNION ALL
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20230212`, "2023-02-15 00:00:00.000000 UTC", "2023-02-15T23:59:59.999999 UTC")
GCS URI will be like
gs://bucket/table=events_20230212/change_date=20230215/*
gs://bucket/table=events_20230212/change_date=20230214/*
gs://bucket/table=events_20230212/change_date=20230213/*
Question: Is the above approach the right way to move ahead or is there any better alternate way?
Thanks
If I understand your goals and your questions correctly, your approach is the right way to move ahead. You are using the change history feature of BigQuery to identify the data that was inserted on February 15, 2023, and then exporting that data to Google Cloud Storage. This will ensure that you are only exporting the data that is new, and that you are not overwriting any existing data.
Here are some other things to keep in mind when exporting Google Analytics 4 data to Google Cloud Storage:
I hope this helps!
@darsh_shukla thanks for the observation and the proposed solution. We have a framework which taking an date of the format 'YYYYMMDD' it generates an export query and i have generated the query some thing like this. But iam getting an error on line 8 saying thats not the signature for appends function.
WITH
changed_records AS (
SELECT
*,
DATE(_CHANGE_TIMESTAMP) AS change_date
FROM
APPENDS(TABLE `project.dataset.events_20240727`,
FORMAT_TIMESTAMP('%FT%T.0000 UTC', TIMESTAMP(PARSE_DATE('%Y%m%d', '20240801'), 'UTC')),
'2024-08-01T23:59:59.999999 UTC'
)
)
select event_date, change_date, count(*)
from changed_records
group by 1, 2 order by change_date
limit 10;
Any suggestions or workarounds will be highly regarded.
Thanks,
Jai