Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Incremental models using self have the same bytes billed

Hi there,

We are on on-demand billing (billed per bytes processed).

We are finding that when we run the recommended setup for incremental models on dataform ( WHERE date > (SELECT MAX(date) FROM ${self()}) ) then there is no difference between bytes processed of the incremental and non-incremental builds (the time taken is less, but as I mention to start, we are billed per bytes, not time). FYI, we are using a unique_key in the model config.

However, when we try and use a more inaccurate version using date subs (something like: WHERE date > DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY) ) then it does have a big difference. And FYI we run this model at least every day, so the MAX(date) will be effectively the same of more recent than 1 day ago.

Why is this, or what might we be doing wrong? I would expect that using the self subquery would only process a very small amount of bytes as it's only looking at one field (and often the partitioned field), and then the amount of data inserted will be much less.

Thanks,

Benji

Solved Solved
2 4 576
1 ACCEPTED SOLUTION

Hi @benjikj-cera,

Welcome to Google Cloud Community!

You are right that self() function will help you with less processed bytes. This will simply reference the previous execution of the same model to determine which data to process.

 

(WHERE date > (SELECT MAX(date) FROM ${self()})). 

 

Based on the condition you’ve set up, this statement is correct. It ensures that only rows from the specified table where the date is greater than the maximum date currently in self() are included. This query is aligned on handling incremental tables in Dataform.

One reason why {self()} might process more data than you expected is because it dynamically refers to the latest processed date from the previous run of the model. This forces BigQuery to scan all the data to determine which data matches the criteria.

Additionally, if you have a large dataset that undergoes frequent updates, even slight changes in the WHERE clause can influence the amount of data processed.

Consider the following solutions:

  • If your source is referenced to a partitioned table, it’s best to use constant expression with the pre_operations block to avoid full table scans of the partitioned table during each incremental update.
  • Try to use a more specific unique_key to figure out what is the best balance for your where clause.
  • It’s also ideal to configure partitionBy and updatePartitionFilter to focus updates within specific segments of a partition, simplifying the process.

Note: When you first create and run an incremental table. This initial process can involve a significant amount of data, sometimes similar to a non-incremental build.

You can also refer to the Medium article that provides a detailed explanation of creating an incremental table: Modern data pipeline building with BigQuery Dataform — Incremental Tables. Additionally, you can check this related Google Cloud Community question regarding incremental tables that may help.

I hope the above information is helpful.

View solution in original post

4 REPLIES 4