Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform - differing values after full refresh

Hi,

My issue is that the weekly rebuild of the dataform pipeline results in changes to overall total volumes from historical data. 

Every Sunday there is a complete rebuild of the pipeline to integrate any new business dimensions that have been added that week. So the underlying metrics should be unchanged from say December, because the source data for that hasn't changed.

What I see is a slight variation in volumes for certain groupings, after each rebuild. So the total volumes may differ by 2-3% week to week when looking at the same date range.

I am wondering if there is an reason for this in BigQuery - is there any sampling applied which means not all rows are always included on a large dataset.

 

Thanks

 

0 1 160
1 REPLY 1

Hi @thd111,

Welcome to Google Cloud Community!

Your Dataform pipeline's weekly rebuilds show a 2-3% variation in total volumes, even for unchanged historical data. This isn't likely due to BigQuery sampling. The problem is most probably within your Dataform pipeline itself. Focus your debugging on:

  • Data source schema: Check for subtle changes in data types or structure.
  • SQL transformations: Review your query logic focusing on JOIN’s, aggregations (SUM, COUNT, AVG), and filters (WHERE) for errors. Add assertions to validate intermediate results.
  • Dataform configuration: Ensure settings don't interfere with the full refresh.
  • Data types: Ensure explicit and consistent data types to avoid rounding errors.

Take note: To troubleshoot, isolate the problem area, compare data from different weeks, test with smaller datasets, and add more logging and assertions to your Dataform code. The issue is almost certainly a coding error, not a BigQuery problem.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.