I have encountered a weird bug relating to required partition filter columns and merge statements.
The summary is basically when there is both a filter on the partition column AND a `false` condition (which may occur as we use dbt and dynamically populate this value with true or false), the query itself runs with no issue but when it gets used as part of the merge statement, BigQuery returns the `Cannot query over table 'test_partitioned_table' without a filter over column(s) 'tstamp' that can be used for partition elimination` error, despite the filter clearly being in the query.
I assume that the optimizer is somehow removing this filter (because it sees the other clause will always be false), but this is leading to the error. For some reason this isn't removed when it isn't part of a merge statement.
A minimum working example to see the issue is below, does anyone have any experience or can confirm this is a bug?
CREATE TABLE
test_partitioned_table ( id string,
tstamp timestamp )
PARTITION BY
TIMESTAMP_TRUNC(tstamp, DAY) OPTIONS ( require_partition_filter = TRUE);
INSERT INTO
dbt_ryan.test_partitioned_table (id,
tstamp)
VALUES
('hello', timestamp '2023-05-01');
INSERT INTO
test_partitioned_table (id,
tstamp)
VALUES
('world', timestamp '2023-05-02');
CREATE TABLE
test_partitioned_table2 ( id string,
start_tstamp timestamp )
PARTITION BY
TIMESTAMP_TRUNC(start_tstamp, DAY) OPTIONS ( require_partition_filter = FALSE);
-- this works with no issue
SELECT
id,
tstamp AS start_tstamp
FROM
test_partitioned_table
WHERE
tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
AND FALSE
-- This causes and error based on not filtering on the partition key
MERGE INTO
test_partitioned_table2 dest
USING
(
SELECT
id,
tstamp AS start_tstamp
FROM
test_partitioned_table
WHERE
tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
AND FALSE ) src
ON
(dest.id = src.id)
WHEN MATCHED THEN UPDATE SET dest.id = src.id, dest.start_tstamp = src.start_tstamp
WHEN NOT MATCHED
THEN
INSERT
(id,
start_tstamp)
VALUES
(id, start_tstamp)
The error message you're encountering is due to an issue with BigQuery's optimizer. It's incorrectly removing the filter on the 'tstamp' column, assuming the false condition will always be false. As a result, it scans all partitions in the 'test_partitioned_table' table, even though the 'tstamp' filter should limit it to the '2023-03-01' partition only.
This is a known bug in BigQuery's optimizer. Google has been notified about this issue and they're currently working on a solution. However, until the bug is fixed, you can use the following workarounds:
MERGE INTO test_partitioned_table AS t
USING (
SELECT * FROM test_partitioned_table
WHERE tstamp = '2023-03-01' AND false
) AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET data = s.data
WHEN NOT MATCHED THEN INSERT (id, tstamp, data) VALUES (s.id, s.tstamp, s.data)
search_condition = tstamp = '2023-03-01';
Thanks for your reply - is there a publicly available issue in the IssueTracker (https://issuetracker.google.com/issues) that I can follow please?
I also tried both your suggestions, however neither sql codes are valid, and in the second case this is already actually specifically the issue (that a required partition filter is already set, but it for some reason is not being used).
Another workaround could be to create a temporary table with the SELECT statement and then use that temporary table in the MERGE statement. This could potentially bypass the issue with the partition filter being removed.
Here's an example:
CREATE TEMPORARY TABLE temp_table AS
SELECT
id,
tstamp AS start_tstamp
FROM
test_partitioned_table
WHERE
tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
AND FALSE;
MERGE INTO
test_partitioned_table2 dest
USING
temp_table src
ON
(dest.id = src.id)
WHEN MATCHED THEN UPDATE SET dest.id = src.id, dest.start_tstamp = src.start_tstamp
WHEN NOT MATCHED
THEN
INSERT
(id,
start_tstamp)
VALUES
(id, start_tstamp);
is this related to the optimizer bug you speak of? in our case we are creating the target table with the require_partition_filter, that's what is throwing the error in the first place it seems. when the partition_filter is not required, the query "works", but likely all partitions are scanned'
Hi @dataders ,Thank you for providing the link to the GitHub issue. Yes, this issue raised by @ryan_snowplow is related to the behavior described above.
Here are a few workarounds you can consider:
Use a Temporary Table: As previously suggested, you can materialize the result of your SELECT
statement into a temporary table and then use that temporary table in your MERGE
statement. This way, the optimizer's behavior with the original SELECT
statement becomes irrelevant for the MERGE
operation.
CREATE TEMPORARY TABLE temp_table AS
SELECT
...
Then use temp_table
in your MERGE
statement.
Dynamic Condition: Instead of using a static AND FALSE
condition, use a dynamic condition that the optimizer can't optimize away. For example:
AND (tstamp IS NOT NULL AND FALSE)
Explicit Partition Filter: If you know the specific partitions you're interested in, you can explicitly list them in the WHERE
clause. This might make it harder for the optimizer to mistakenly remove the partition filter.
WHERE tstamp IN (TIMESTAMP "2023-04-01", TIMESTAMP "2023-04-02", ...)
Avoid MERGE with Partitioned Tables: If possible, consider other methods of updating or inserting data into your partitioned table, such as using separate INSERT
and UPDATE
statements instead of a MERGE
. This might not be ideal, but it could be a temporary solution until the issue with MERGE
is resolved.
Hi @dataders ,Is this issue resolved in any new release of dbt version?, or can you please let me know what i need to do, to resolve that
Below scenario not working:
1.Base table has require_partition_filter enabled, and the View query has row_num logic to get unique records on base table.
If you're experiencing issues where a base table in BigQuery has require_partition_filter enabled and a view built on top of this table includes row number logic for deduplication, there are a few considerations and potential solutions:
Partition Filter Requirement in Views:
Row Number Logic and Scans:
Workarounds:
Example Adjustments: If you need to keep the require_partition_filter and use a view, your view definition could look something like this:
CREATE OR REPLACE VIEW your_view AS
SELECT *, ROW_NUMBER() OVER (PARTITION BY some_partition_col ORDER BY some_ordering_col) AS row_num
FROM your_base_table
WHERE some_partition_col = 'specific_partition_value'; -- Enforces partition filter
When querying the view, always include a partition filter:
SELECT * FROM your_view
WHERE some_partition_col BETWEEN 'start_date' AND 'end_date';
These suggestions aim to balance the strict requirements of partition filters in BigQuery with the flexibility and performance considerations of using views for complex queries.
-- Enforces partition filter 'specific_partition_value' filter is not ideal solution for all the cases.
True, using a specific partition value in the view definition may not be ideal solution, especially when you need to handle multiple partitions dynamically. Here are some alternative strategies that might help:
Instead of hardcoding partition filters in the view, you can modify your queries to pass the required partition filters dynamically. This approach gives you the flexibility to specify different partitions based on your use case when querying the view.
-- Assuming you've set up the view without a specific partition filter:
SELECT * FROM your_view WHERE partition_column BETWEEN @start_date AND @end_date;
If the query patterns are predictable and the data isn't updated frequently, consider using materialized views which can be set to refresh on a schedule. Materialized views can include partition filters that handle a broader range of dates or other partitioning criteria, and they store precomputed results that can make querying more efficient.
CREATE MATERIALIZED VIEW your_materialized_view AS
SELECT *, ROW_NUMBER() OVER (PARTITION BY partition_col ORDER BY some_col) AS row_num
FROM base_table
WHERE partition_col BETWEEN 'start_range' AND 'end_range';
This view would need to be refreshed periodically but allows for faster access to precomputed data.
In some cases, you can write more complex SQL logic in the view to handle a variety of partitions without specifying exact partition values. This might involve using functions or more complex conditions to dynamically determine which partitions to access based on the query's context or additional filters provided at runtime.
If the partition filter requirement proves too restrictive and cannot be efficiently managed within BigQuery alone, consider retrieving a broader dataset into your application or middleware layer and applying further filtering or processing there. This is less efficient in terms of data transfer but might be necessary if the partitioning scheme is highly complex.
If these restrictions are proving too limiting, it might be worth revisiting your partitioning strategy. Perhaps a different partitioning key or a change in the partitioning granularity could align better with your query patterns.
Each of these approaches has trade-offs in terms of performance, cost, and complexity. The right choice depends on your specific requirements, such as how often the data is accessed, the typical query patterns, and how dynamic the partition filters need to be. Adjusting these strategies to match your operational needs can help you manage the require_partition_filter
constraint more effectively.
1.Removing the duplicates in base table and filter partitioning and pass that query to View.
2.I tried adding the partitioning along with cluster and Partition filter not working and its scanning entire data.
Here are a few strategies and considerations that might help refine your approach:
Ensure Correct Partition Filters in Views: When you create a view in BigQuery that relies on a partitioned base table, remember that the view itself does not automatically inherit the partitioning behavior of the base table. Make sure that any query against the view explicitly includes a partition filter. This is necessary even if the underlying table has a require_partition_filter
setting.
Verify Partition and Clustering Definitions: If you have added partitioning along with clustering, ensure that the partitioning key is used effectively in your queries. BigQuery optimizes query performance by pruning partitions when the query filters directly reference the partitioned columns. If the partitioning column isn't properly referenced, BigQuery might end up scanning more data than necessary.
Using QUALIFY with Window Functions: If your aim is to eliminate duplicates using window functions (like ROW_NUMBER()
), consider using the QUALIFY
clause to filter the rows within the same query block. This can sometimes help in ensuring that the filters and window functions are applied efficiently.
SELECT * FROM base_table WHERE partition_column BETWEEN 'start_date' AND 'end_date'
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) = 1;
Debugging Data Scans: Use the EXPLAIN
statement to understand how your queries are being executed. This can provide insights into whether partition pruning is happening as expected or if full table scans are being performed.
Reevaluate Partitioning and Clustering Choices: Sometimes, the choice of partitioning and clustering keys might not align well with the typical query patterns. Reevaluating and possibly redesigning the partitioning and clustering keys based on the most common access patterns can lead to better performance.
Materialized Views for Aggregated Queries: If your view primarily serves aggregated data or results from complex transformations, consider using a materialized view which can store pre-computed results and is optimized for read access.
Adjustments in Data Loading or Transformation Logic: Sometimes, adjustments in the way data is loaded or pre-processed can reduce the need for complex deduplication logic in SQL queries. Ensuring data quality at the ingestion stage might reduce complexity and improve performance downstream.
Its 2025 am still having the same issue