BigQuery Advanced Runtime and Enhanced Vectorizati...

amaksimenka

What is Advanced Runtime?

BigQuery is Google Cloud's autonomous data to AI platform, streamlining your data journey from ingestion to AI-driven action. This fully managed, serverless enterprise data warehouse separates storage and compute providing a unique resource allocation flexibility and enabling petabyte-scale analysis at optimized costs. Key features like compressed storage, compute autoscaling, and flexible pricing contribute to its efficiency. Underpinning BigQuery is Google's powerful infrastructure, including technologies like Borg, Colossus, Jupiter, and Dremel, as discussed in our "Under the Hood" blog post.

BigQuery is continually pushing the limits of query price/performance, driven by significant advancements in Google's underlying infrastructure. Google infrastructure innovations such as L4 in Colossus, userspace host networking, optimized BigQuery storage formats, and cutting-edge data center network have enabled a complete modernization of BigQuery's core data warehousing technology. Crucially, BigQuery maintains its core principles of self-tuning and zero user intervention, guaranteeing the best possible price/performance for all queries.

To further enhance this, we've been working on a series of "Advanced Runtime" improvements, which will be the focus of our upcoming blog posts, starting with Enhanced Vectorization.

Introducing Enhanced Vectorization

Before diving into Enhanced Vectorization, let's clarify vectorized query execution. In vectorized execution, the columnar data is processed in CPU cache-sized blocks using SIMD (Single Instruction Multiple Data) instructions and is now a de-facto industry standard for efficient query processing. BigQuery's Enhanced Vectorization expands on vectorized query execution by applying it to key aspects of query processing, such as filter evaluation in BigQuery storage, support for parallel execution of query algorithms, and through specialized data encodings and optimization techniques. Let's take a closer look.

Data encodings

Modern columnar storage formats use space efficient data encodings such as dictionary and run-length encodings. For instance, if a column has a million rows but only 10 unique values, dictionary encoding stores those 10 values once and assigns a smaller integer ID to each row, rather than repeating the full value. Enhanced Vectorization can directly process this encoded data, eliminating redundant computations and significantly boosting query performance. The smaller memory footprint of this encoded data also improves cache locality and creates more opportunities for vectorization.

Enhanced Vectorization_ Dictionary and run-length encodings explained.png

Figure 1: Dictionary and Run-Length encodings

For example, as the figure 1 demonstrates, “Coupe”, “Truck” and “Van” string values are encoded in the dictionary where possibly repeated string literals are replaced with the repeated integers which represent indices in the dictionary built from those string values. Subsequently, the repeated integer values can be further represented with run-length encoding. Both encodings can offer substantial space savings.

Expression folding and common subexpression elimination

Enhanced Vectorization integrates native support for dictionary and run-length encoded data directly into its algorithms. This, combined with optimization techniques such as expression folding, folding propagation, and common subexpression elimination, allows it to intelligently reshape query execution plans. The result can be a significant reduction or complete removal of unnecessary data processing.

Consider a scenario where REGEXP_CONTAINS(id, '[0-9]{2}$') AS shard receives dictionary-encoded input. The REGEXP_CONTAINS calculation is performed only once for each unique dictionary value, and the resulting expression is also dictionary encoded, reducing the number of evaluations significantly and leading to performance improvements.

Enhanced Vectorization_ Dictionary folding.png

Figure 2: Dictionary folding

Here, the calculation is applied to the input dictionary-encoded data directly producing output dictionary-encoded data skipping the dictionary expansion.

We take expression folding optimization even further by, in some cases, converting an expression into a constant. Consider this query:

SELECT SUM(number) FROM table
WHERE REGEXP_CONTAINS(id, '^.*[0-9]{2}');

If id in the Capacitor file for this table being processed on the current Dremel leaf node is dictionary encoded, the system's expression folding will evaluate all dictionary values, and, because none of its values contain two digits, determine the REGEXP_CONTAINS condition is always false, and replace the WHERE clause with a constant false. As a result, BigQuery completely skips scanning the Capacitor file for this table, significantly boosting performance. Of course, these optimizations are applicable across a broad range of scenarios and not just to the query used in this example.

Data encoding-enabled optimizations

Our state-of-the art join algorithm tries to preserve dictionary and run-length encoded data wherever possible and makes runtime decisions taking data encoding into account. For example, if the probe side in the join key is dictionary encoded, we can use that knowledge to avoid repeated hash table lookups. Also, during aggregation, we can skip building a hashmap if data is already dictionary encoded and its cardinality is known.

Parallelizable join and aggregation algorithms

Enhanced Vectorization harnesses sophisticated parallelizable algorithms for efficient joins and aggregations. When parallel execution is enabled in a Dremel leaf node for certain query execution modes, the join algorithm can build and probe join hash tables in parallel using multiple threads. Similarly, aggregation algorithms can perform both local and global aggregations across multiple threads simultaneously. This parallel execution of join and aggregation algorithms leads to a substantial acceleration of query execution.

Tighter integration with Capacitor

Capacitor has been re-engineered for the Enhanced Vectorization runtime, making it smarter and more efficient. This updated version now natively supports semi-structured and JSON data, using sophisticated operators to rebuild JSON data efficiently. Capacitor enables Enhanced Runtime to directly access dictionary and run-length encoded data and apply various optimizations based on data. It intelligently applies folding to a constant optimization when an entire column has the same value. And it can prune expressions in functions expecting NULL, such as IF_NULL and COALESCE, when a column is confirmed to be NULL-free.

Filter pushdown in Capacitor

Capacitor leverages the same vectorized engine as Enhanced Vectorization to efficiently push down filters and computations. This allows for tailored optimizations based on specific file characteristics and the expressions used. When combined with dictionary and run-length encoded data, this approach delivers exceptionally fast and efficient data scans, enabling further optimizations like expression folding.

Enhanced Vectorization in action

Let's illustrate the power of these techniques with a concrete example. Enhanced Vectorization accelerated one query by 21 times, slashing execution time from over a minute (61 seconds) down to 2.9 seconds.

The query that achieved this dramatic speedup was:

SELECT
  ANY_VALUE(id) AS id,
  hash_id
FROM (
  SELECT
    CAST(source_id AS STRING) AS id,
    TO_HEX(SHA1(CAST(source_id AS STRING))) AS hash_id
  FROM `source_data`)
WHERE
  hash_id IS NOT NULL
GROUP BY
  hash_id

This query ran against a table with over 13 billion logical rows spread across 167 partitions, stored in Capacitor columnar storage format and optimized with dictionary and run-length encoded data.

Without Enhanced Vectorization

A regular query engine executing this query would involve several steps:

Reading all data for each partition, fully expanding the dictionary and run-length encoded columnar data.
Computing CAST(source_id AS STRING) and TO_HEX(SHA1(CAST(source_id AS STRING))) for every single columnar data value.
Building a hashmap from all the non-NULL hash_id values.

With Enhanced Vectorization

When Enhanced Vectorization processed the same query over the same dataset, it automatically applied these crucial optimizations:

It directly scanned the columnar data in the Capacitor file while preserving its dictionary encoded data.
It detected and eliminated duplicate computations for CAST(source_id AS STRING) by identifying them as common subexpressions.
It folded the TO_HEX(SHA1(CAST(source_id AS STRING))) computation, propagating the resulting dictionary-encoded data directly to the aggregation step.
The aggregation step recognized the data was already dictionary-encoded, allowing it to completely skip building a hashmap for aggregation.

This example of 21 times query speedup vividly demonstrates how tight integration between Enhanced Runtime and Capacitor and various optimization techniques can lead to substantial query performance improvements.

Try it for yourself

Advanced Runtime and Enhanced Vectorization are now in preview. Over the coming months, they will be enabled for all customers by default, but you can enable them earlier for your project.

What’s next

BigQuery's Enhanced Vectorization significantly improves query price/performance. Internally, we've seen a substantial reduction in query latency with comparable or even lower slot utilization with Enhanced Runtime, though individual query results can differ. This performance gain comes from innovations in both Enhanced Vectorization and BigQuery's storage formats. We're dedicated to continuously improving both, applying even more advanced optimizations alongside Google's infrastructure advancements in storage, compute, and networking to further boost query efficiency and expand the range of queries Enhanced Runtime can handle. BigQuery will soon offer enhanced vectorization for Parquet files and Iceberg tables.