Non-uniform integer partitioning in Google BigQuer...

shekhar0413 · 06-04-2024 01:48 PM

In BigQuery integer range partitioning, is it possible to create non-uniform partition intervals? The use case that I have is that the date column is an integer (20201231 for example). All current queries target this date format and cannot be changed so the partition has to be on this field. Integer partitioning for this works, but only for the yearly case, eg:

PARTITION BY RANGE_BUCKET(INTEGER_DATE, GENERATE_ARRAY(20150000, 20250000, 10000))

for partitioning from 2015 to 2024. However, for other cases there are problems. Daily partitioning creates too many partitions for which there are no corresponding actual dates and easily crosses BQ's limit of 4000 partitions while not being able to span even 1 year's worth of data. For other cases (weekly, monthly etc.), the intervals are not aligned with the yearly calendar so some weeks and months will span two partitions. I tried the below (specifying a hardcoded array instead of GENERATE_ARRAY in the PARTITION BY clause) but BQ gives an error:

PARTITION BY RANGE_BUCKET(INTEGER_DATE, [20150000, 20160000, 20170000])

So, directly specifying the partition array does not work.

Does anyone know any alternatives, work arounds to this?

ms4446

In BigQuery, creating non-uniform partition intervals for integer range partitioning directly is not supported, as you've noticed with the errors you've encountered. BigQuery currently only supports uniform range partitioning for integers. However, you can achieve non-uniform partitioning indirectly by leveraging additional columns or calculated fields. Here are a few workarounds:

Instead of partitioning by your integer date, consider using ingestion-time partitioning. This partitions data based on when it's loaded into BigQuery. While not directly aligned with your integer date, you can combine it with clever query filtering (e.g., using _PARTITIONTIME pseudo column) to achieve similar performance benefits.

Create a user-defined function (UDF) to map your integer date to custom partition values.

Example UDF :

 

CREATE OR REPLACE FUNCTION my_dataset.custom_partition(integer_date INT64)
RETURNS INT64 AS (
  CASE
    WHEN integer_date >= 20230101 AND integer_date < 20240101 THEN 2023
    WHEN integer_date >= 20220101 AND integer_date < 20230101 THEN 2022
    -- Add more cases as needed
    ELSE 0  -- Default partition
  END
);

 

Use this UDF in your table creation:

 
CREATE TABLE your_partitioned_table
PARTITION BY my_dataset.custom_partition(INTEGER_DATE) AS
SELECT * FROM your_table;

Table Clustering: If optimizing query performance is your primary goal, consider clustering your table on the INTEGER_DATE column. This physically stores related rows together, improving query efficiency.

Example:

 
CREATE TABLE your_clustered_table
PARTITION BY RANGE_BUCKET(INTEGER_DATE, GENERATE_ARRAY(20150000, 20250000, 10000))
CLUSTER BY INTEGER_DATE AS
SELECT * FROM your_table;

Important Considerations

Data Distribution: Carefully analyze your data distribution before choosing a partitioning strategy. Uneven data distribution can lead to partition imbalance and negate performance benefits.
Query Patterns: Understand your typical query patterns to select a partitioning strategy that best aligns with your use cases.

Hybrid Partitioning with Custom Filtering:

To achieve a similar performance benefit as partitioning by integer date, use a combination of ingestion-time partitioning and filtering:

 
SELECT *
FROM my_dataset.my_table
WHERE _PARTITIONTIME >= TIMESTAMP("2023-01-01")
  AND _PARTITIONTIME < TIMESTAMP("2024-01-01")
  AND INTEGER_DATE >= 20230101 AND INTEGER_DATE < 20240101;

Ensure you choose the right partitioning strategy based on your specific use case, data distribution, and query patterns. For non-uniform partitioning, consider using a custom partitioning function or clustering to achieve the desired performance and flexibility.

shekhar0413

Hi @ms4446 ,

Thanks for the response. The UDF example does not work for me. When trying to create the partitioned table as you mentioned, BQ gives the error: "PARTITION BY expression must be DATE, a DATE column ...".

Secondly, if however, this were successful, I assume the queries would also need to change right? The queries would have to use the UDF too, which is a problem for us since we have multiple users (internal and external) running queries using the integer date and we cannot expect them all to change that.

ms4446

You're right about the limitations of using UDFs for partitioning in BigQuery. BigQuery requires the partition column to be a DATE, TIMESTAMP, or DATETIME type for time-based partitioning, or an INTEGER for range partitioning. Since using a UDF to map integers for partitioning isn't feasible, you'll need to consider alternative approaches that maintain the existing query patterns without requiring users to change their queries.

RANGE_BUCKET with GENERATE_ARRAY only supports uniform intervals, making it unsuitable for your non-uniform yearly-aligned partitioning needs.

BigQuery doesn't support directly passing a custom array to RANGE_BUCKET for partitioning.

The 4000 partition limit can be a constraint, especially with granular partitions such as daily intervals.

Workarounds and Alternatives

1. Use Integer Range Partitioning with Broader Ranges:

Create Yearly Partitions: Partition your table by broader integer ranges (e.g., by year) to reduce the number of partitions and avoid crossing the 4000 partition limit.

Example:
```
CREATE TABLE your_partitioned_table
PARTITION BY RANGE_BUCKET(INTEGER_DATE, GENERATE_ARRAY(20150000, 20250000, 10000)) AS
SELECT * FROM your_table;
```

2. Ingestion-Time Partitioning:

Combine with Integer Partitioning: Use ingestion-time partitioning to handle the data load efficiently and filter based on integer dates in queries. This avoids the complexity of changing user queries.

Example:

 
CREATE TABLE your_ingestion_partitioned_table
PARTITION BY _PARTITIONTIME AS
SELECT * FROM your_table;

Query Example:

 

SELECT *
FROM your_ingestion_partitioned_table
WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2023-01-01") AND TIMESTAMP("2023-12-31")
  AND INTEGER_DATE BETWEEN 20230101 AND 20231231;

3. Pre-Partitioning Data:

Create Multiple Tables: Create separate tables for different ranges of your integer dates and use table wildcard functions to query across these tables.

Example:
- Table your_table_2015 for 2015 data.
- Table your_table_2016 for 2016 data.
Query Example:
```
SELECT *
FROM `your_dataset.your_table_*`
WHERE _TABLE_SUFFIX BETWEEN '2015' AND '2016'
  AND INTEGER_DATE BETWEEN 20150101 AND 20161231;
```

4. Clustering for Query Optimization:

Combine Partitioning and Clustering: Use yearly partitioning combined with clustering on INTEGER_DATE to improve performance within each partition.

Example:

 
CREATE TABLE your_clustered_table
PARTITION BY RANGE_BUCKET(INTEGER_DATE, GENERATE_ARRAY(20150000, 20250000, 10000))
CLUSTER BY INTEGER_DATE AS
SELECT * FROM your_table;

Important Considerations

Data Distribution: Analyze your data distribution to choose the best partitioning strategy. Uneven distribution can lead to imbalanced partitions and reduced performance.
Query Patterns: Understand typical query patterns to select a partitioning strategy that aligns with your use cases and minimizes the impact on existing queries.

Non-uniform integer partitioning in Google BigQuery