Dataform Incremental Backfill

Hi guys,

I am trying to create an incremental table for a web and app event tracker but because I have a limit on my Big Query, I run into the error below

"Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 132% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%"

Does anyone know a workaround of creating a destination table that has all the data so that incremental sqlx can just build on it.

Solved Solved
0 4 705
2 ACCEPTED SOLUTIONS

there are a few strategies you can use to reduce memory usage:

  1. Limit the number of rows processed: You can use the WHERE clause to limit the number of rows processed by your query. This can help reduce memory usage, especially if you're dealing with a large dataset.

  2. Avoid using analytic functions: Analytic functions like OVER() can consume a lot of memory. If possible, try to rewrite your query to avoid using these functions.

As for your question

Yes, there is a known workaround for creating a destination table that has all the data so that incremental sqlx can just build on it. Here are the steps:

  1. Create a destination table with all the data.

In your Dataform file, add the following code:

config {
type: "incremental",
bigquery: {
partitionBy: "DATE(timestamp)",
clusterBy: ["user_id", "event_id"]
}
}

SELECT
*
FROM
${ref("your_source_table")}
WHERE
DATE(timestamp) > (SELECT MAX(DATE(timestamp)) FROM ${self()})

In the above code, the WHERE clause ensures that only new data (i.e., data with a timestamp greater than the maximum timestamp in the existing table) is added to the table. This can help reduce memory usage by limiting the amount of data processed in each run.

View solution in original post

The error you're encountering suggests that the table you're trying to reference doesn't exist in your database.

Here are a few things to check:

  1. Table Reference: Make sure that the table you're referencing in ${ref("your_source_table")} actually exists in your database. Replace "your_source_table" with the correct name of your source table.

  2. Initial Run: For the first run, since the incremental table does not exist yet, the WHERE clause DATE(timestamp) > (SELECT MAX(DATE(timestamp)) FROM ${self()}) will fail because ${self()} does not exist. You can modify the WHERE clause to account for this, like so:

WHERE
DATE(timestamp) > COALESCE((SELECT MAX(DATE(timestamp)) FROM ${self()}), DATE('2000-01-01'))

This will use a default date (in this case, '2000-01-01') for the first run when the table does not exist yet.

  1. Permissions: Ensure that your Dataform project has the necessary permissions to create and modify tables in your BigQuery database.

 

View solution in original post

4 REPLIES 4

there are a few strategies you can use to reduce memory usage:

  1. Limit the number of rows processed: You can use the WHERE clause to limit the number of rows processed by your query. This can help reduce memory usage, especially if you're dealing with a large dataset.

  2. Avoid using analytic functions: Analytic functions like OVER() can consume a lot of memory. If possible, try to rewrite your query to avoid using these functions.

As for your question

Yes, there is a known workaround for creating a destination table that has all the data so that incremental sqlx can just build on it. Here are the steps:

  1. Create a destination table with all the data.

In your Dataform file, add the following code:

config {
type: "incremental",
bigquery: {
partitionBy: "DATE(timestamp)",
clusterBy: ["user_id", "event_id"]
}
}

SELECT
*
FROM
${ref("your_source_table")}
WHERE
DATE(timestamp) > (SELECT MAX(DATE(timestamp)) FROM ${self()})

In the above code, the WHERE clause ensures that only new data (i.e., data with a timestamp greater than the maximum timestamp in the existing table) is added to the table. This can help reduce memory usage by limiting the amount of data processed in each run.

Thank you for this!


@ms4446 wrote:

config {
type: "incremental",
bigquery: {
partitionBy: "DATE(timestamp)",
clusterBy: ["user_id", "event_id"]
}
}

SELECT
*
FROM
${ref("your_source_table")}
WHERE
DATE(timestamp) > (SELECT MAX(DATE(timestamp)) FROM ${self()})


 

So I must create the table manually before making the dataform type to be incremental? I thought with the type as incremental, it should automatically create the table.
I am trying to confirm this because I got an error of table not in database when I ran th incremental code directly without creating the table first

The error you're encountering suggests that the table you're trying to reference doesn't exist in your database.

Here are a few things to check:

  1. Table Reference: Make sure that the table you're referencing in ${ref("your_source_table")} actually exists in your database. Replace "your_source_table" with the correct name of your source table.

  2. Initial Run: For the first run, since the incremental table does not exist yet, the WHERE clause DATE(timestamp) > (SELECT MAX(DATE(timestamp)) FROM ${self()}) will fail because ${self()} does not exist. You can modify the WHERE clause to account for this, like so:

WHERE
DATE(timestamp) > COALESCE((SELECT MAX(DATE(timestamp)) FROM ${self()}), DATE('2000-01-01'))

This will use a default date (in this case, '2000-01-01') for the first run when the table does not exist yet.

  1. Permissions: Ensure that your Dataform project has the necessary permissions to create and modify tables in your BigQuery database.