Hi,
My issue is that the weekly rebuild of the dataform pipeline results in changes to overall total volumes from historical data.
Every Sunday there is a complete rebuild of the pipeline to integrate any new business dimensions that have been added that week. So the underlying metrics should be unchanged from say December, because the source data for that hasn't changed.
What I see is a slight variation in volumes for certain groupings, after each rebuild. So the total volumes may differ by 2-3% week to week when looking at the same date range.
I am wondering if there is an reason for this in BigQuery - is there any sampling applied which means not all rows are always included on a large dataset.
Thanks
Hi @thd111,
Welcome to Google Cloud Community!
Your Dataform pipeline's weekly rebuilds show a 2-3% variation in total volumes, even for unchanged historical data. This isn't likely due to BigQuery sampling. The problem is most probably within your Dataform pipeline itself. Focus your debugging on:
Take note: To troubleshoot, isolate the problem area, compare data from different weeks, test with smaller datasets, and add more logging and assertions to your Dataform code. The issue is almost certainly a coding error, not a BigQuery problem.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.