Solved: Re: Incremental models using self have the same by...

benjikj-cera · 07-03-2024 03:12 AM

Hi there,

We are on on-demand billing (billed per bytes processed).

We are finding that when we run the recommended setup for incremental models on dataform ( WHERE date > (SELECT MAX(date) FROM ${self()}) ) then there is no difference between bytes processed of the incremental and non-incremental builds (the time taken is less, but as I mention to start, we are billed per bytes, not time). FYI, we are using a unique_key in the model config.

However, when we try and use a more inaccurate version using date subs (something like: WHERE date > DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY) ) then it does have a big difference. And FYI we run this model at least every day, so the MAX(date) will be effectively the same of more recent than 1 day ago.

Why is this, or what might we be doing wrong? I would expect that using the self subquery would only process a very small amount of bytes as it's only looking at one field (and often the partitioned field), and then the amount of data inserted will be much less.

Thanks,

Benji

caryna

Hi @benjikj-cera,

Welcome to Google Cloud Community!

You are right that self() function will help you with less processed bytes. This will simply reference the previous execution of the same model to determine which data to process.

(WHERE date > (SELECT MAX(date) FROM ${self()})).

Based on the condition you’ve set up, this statement is correct. It ensures that only rows from the specified table where the date is greater than the maximum date currently in self() are included. This query is aligned on handling incremental tables in Dataform.

One reason why {self()} might process more data than you expected is because it dynamically refers to the latest processed date from the previous run of the model. This forces BigQuery to scan all the data to determine which data matches the criteria.

Additionally, if you have a large dataset that undergoes frequent updates, even slight changes in the WHERE clause can influence the amount of data processed.

Consider the following solutions:

If your source is referenced to a partitioned table, it’s best to use constant expression with the pre_operations block to avoid full table scans of the partitioned table during each incremental update.
Try to use a more specific unique_key to figure out what is the best balance for your where clause.
It’s also ideal to configure partitionBy and updatePartitionFilter to focus updates within specific segments of a partition, simplifying the process.

Note: When you first create and run an incremental table. This initial process can involve a significant amount of data, sometimes similar to a non-incremental build.

You can also refer to the Medium article that provides a detailed explanation of creating an incremental table: Modern data pipeline building with BigQuery Dataform — Incremental Tables. Additionally, you can check this related Google Cloud Community question regarding incremental tables that may help.

I hope the above information is helpful.

View solution in original post

caryna

Hi @benjikj-cera,

Welcome to Google Cloud Community!

You are right that self() function will help you with less processed bytes. This will simply reference the previous execution of the same model to determine which data to process.

(WHERE date > (SELECT MAX(date) FROM ${self()})).

Based on the condition you’ve set up, this statement is correct. It ensures that only rows from the specified table where the date is greater than the maximum date currently in self() are included. This query is aligned on handling incremental tables in Dataform.

One reason why {self()} might process more data than you expected is because it dynamically refers to the latest processed date from the previous run of the model. This forces BigQuery to scan all the data to determine which data matches the criteria.

Additionally, if you have a large dataset that undergoes frequent updates, even slight changes in the WHERE clause can influence the amount of data processed.

Consider the following solutions:

If your source is referenced to a partitioned table, it’s best to use constant expression with the pre_operations block to avoid full table scans of the partitioned table during each incremental update.
Try to use a more specific unique_key to figure out what is the best balance for your where clause.
It’s also ideal to configure partitionBy and updatePartitionFilter to focus updates within specific segments of a partition, simplifying the process.

Note: When you first create and run an incremental table. This initial process can involve a significant amount of data, sometimes similar to a non-incremental build.

You can also refer to the Medium article that provides a detailed explanation of creating an incremental table: Modern data pipeline building with BigQuery Dataform — Incremental Tables. Additionally, you can check this related Google Cloud Community question regarding incremental tables that may help.

I hope the above information is helpful.

benjikj-cera

Hi @caryna ,

Thanks for your reply.

The first option you mention has worked very well for us! Reduces our bytes scanned for the incremental run from 38GB to 262MB for one example model.

We were as specific as possible in our unique key, based off the ID of the table.

The updatePartitionField did not seem to change the estimated bytes processed, however, I wonder if the BIgQuery config settings do not get accounted for in the dataform estimated bytes. The first suggestion has helped us enough, though, so I'm happy to leave that one. But a question out of interest anyway: does this filter only apply to the incremental model, and the non-incremental model completely ignores it?

Much appreciated!

Benji

caryna

Hi @benjikj-cera,

@benjikj-cera wrote:

I wonder if the BIgQuery config settings do not get accounted for in the dataform estimated bytes.

No, Dataform and BigQuery serve different purposes. Dataform is specifically designed for data transformation and managing data pipelines. It does not directly interact with or access the specific configuration settings of BigQuery.

@benjikj-cera wrote:

does this filter only apply to the incremental model, and the non-incremental model completely ignores it

Yes, updatePartitionFilter and pre_operations blocks in Dataform are specifically intended to optimize incremental table updates. Dataform documentation strongly suggests they are primarily for incremental workflows.

I hope the above information is helpful.

benjikj-cera

@caryna wrote:
@benjikj-cera wrote:
I wonder if the BIgQuery config settings do not get accounted for in the dataform estimated bytes.
No, Dataform and BigQuery serve different purposes. Dataform is specifically designed for data transformation and managing data pipelines. It does not directly interact with or access the specific configuration settings of BigQuery.

Yep, I get they are separate products, but thought Dataform might try and estimate this in some way if a partition key or filter changes. Which would be very useful. So instead we have to actually rerun the models to find out the correct Bytes processed if we change a BigQuery config setting.

@caryna wrote:
@benjikj-cera wrote:
does this filter only apply to the incremental model, and the non-incremental model completely ignores it
Yes, updatePartitionFilter and pre_operations blocks in Dataform are specifically intended to optimize incremental table updates. Dataform documentation strongly suggests they are primarily for incremental workflows.

I get they are intended to be used for incremental models, but I'm asking if the updatePartitionFilter only applies when it's running the incremental build of the model. It seems a bit redundant if the updatePartitionFilter filters out data also when running the non-incremental build, as in that build you very often do not want any filters at all.

Much appreciated!

Benji

Incremental models using self have the same bytes billed