Dictionary encoding BigQuery

ArturD · 01-26-2024 01:47 AM

Hi there,

I found out that for some column which has only 5 different strings it actually scans a lot. Do we have a dictionary encoding in bigquery, so we can optimize our queries for such cases? Let's say the column record_source, which can be one of these values:
- schema_01.table_name
- schema_02.table_name
- schema_03.table_name
- schema_04.table_name
- schema_05.table_name

Maybe, there are some flags to include while creation of the table.

Thanks in advance.

kolban

Howdy. By dictionary encoding, I am thinking that you want to "encode" the stored value of a column to more compact/efficient representation as compared to the values currently stored. For example, if a column's value could only be of 5 possible values:

AAAAA
BBBBB
CCCCC
DDDDD
EEEEE

Rather than saving each value in a row, we could create a dictionary that might (for example) say:

AAAAA = 1
BBBBB = 2
CCCCC = 3
DDDDD = 4
EEEEE = 5

and save the smaller values of 1,2,3,4 or 5

The table would still (logically) hold the same information and instead of searching for "CCCCC" you could search for "5".

With that background, and to the best of my knowledge, there is nothing native in BigQuery that explicitly achieved that effect. What ever data you insert in rows in a table is the data that is stored in the table. That said, let's see if we can't look "behind" the question. When you store data in a BigQuery table, BigQuery has to store that data in storage. BigQuery compresses that storage VERY dramatically. What that means is that even if you store "more data than might actually be necessary", Google is already compressing the data. Since data is stored in columar format, if columns contain "similar" data then they compress highly.