Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dictionary encoding BigQuery

Hi there,

I found out that for some column which has only 5 different strings it actually scans a lot. Do we have a dictionary encoding in bigquery, so we can optimize our queries for such cases? Let's say the column record_source, which can be one of these values:
- schema_01.table_name
- schema_02.table_name
- schema_03.table_name
- schema_04.table_name
- schema_05.table_name

Maybe, there are some flags to include while creation of the table. 

Thanks in advance.

0 1 665
1 REPLY 1

Howdy.  By dictionary encoding, I am thinking that you want to "encode" the stored value of a column to more compact/efficient representation as compared to the values currently stored.   For example, if a column's value could only be of 5 possible values:

  • AAAAA
  • BBBBB
  • CCCCC
  • DDDDD
  • EEEEE

Rather than saving each value in a row, we could create a dictionary that might (for example) say:

  • AAAAA = 1
  • BBBBB = 2
  • CCCCC = 3
  • DDDDD = 4
  • EEEEE = 5

and save the smaller values of 1,2,3,4 or 5

The table would still (logically) hold the same information and instead of searching for "CCCCC" you could search for "5".

With that background, and to the best of my knowledge, there is nothing native in BigQuery that explicitly achieved that effect.  What ever data you insert in rows in a table is the data that is stored in the table.   That said, let's see if we can't look "behind" the question.  When you store data in a BigQuery table, BigQuery has to store that data in storage.  BigQuery compresses that storage VERY dramatically.  What that means is that even if you store "more data than might actually be necessary", Google is already compressing the data.  Since data is stored in columar format, if columns contain "similar" data then they compress highly.