Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Working with BigQuery JSON & nested data

robertcarr
Staff

JSON data

JSON (JavaScript Object Notation) is a lightweight, text-based data format designed for human readability and easy machine parsing. It's a popular standard for transmitting data between systems, particularly in web applications.

A big advantage of using JSON data in BigQuery is that BigQuery treats JSON as semi-structured data, meaning you don't need a fixed schema upfront. This provides flexibility when working with evolving or diverse data.

Nested and repeated data

Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships.

To create a column with nested data, set the data type of the column to RECORD in the schema. A RECORD can be accessed as a STRUCT type in GoogleSQL. A STRUCT is a container of ordered fields. To create a column with repeated data, set the mode of the column to REPEATED in the schema. A repeated field can be accessed as an ARRAY type in GoogleSQL.

Both of these data types in BigQuery can be queried from Looker but sometimes require some SQL manipulation to extract the required data. This article will take you through all the ways you can access these data fields from within Looker.

Load JSON data:

CREATE OR REPLACE TABLE dacoaching.json_table(id INT64, cart JSON, user STRING, location JSON);

INSERT INTO dacoaching.json_table VALUES
(1, JSON_ARRAY(STRUCT(10 AS item_id, 'apple' AS item)), '{"name": "Alice", "age": 30}', JSON '{"steet": "30th Street", "state": "NY"}'),
(2, JSON_ARRAY(STRUCT(15 AS item_id, 'banana' AS item)), '{"name": "Rob", "age": 60}', JSON '{"steet": "Main Street", "state": "NJ"}');

INSERT INTO dacoaching.json_table VALUES
(3, JSON_ARRAY(STRUCT(10 AS item_id, 'apple' AS item), STRUCT(11 AS item_id, 'coconut' AS item)), '{"name": "Alice", "age": 30}', JSON '{"street": "Rodeo Drive", "state": "CA"}'),
(4, JSON_ARRAY(STRUCT(15 AS item_id, 'banana' AS item), STRUCT(16 AS item_id, 'pineapple' AS item)), '{"name": "Rob", "age": 60}', JSON '{"street": "Unknown", "state": "CT"}');

INSERT INTO dacoaching.json_table VALUES
(5, JSON_ARRAY(STRUCT(10 AS item_id, 'apple' AS item), STRUCT(11 AS item_id, 'coconut' AS item)), '{"name": "Alice", "age": 30}', JSON '''{"street": "Rodeo Drive", "state": "CA", "former_locations": [
           {"street": "Mid St", "state": "FL"},
           {"street": "Unknown", "state": "PA"}
       ]}''');

Output:
image1.png

Load Nested and repeated data:

CREATE TABLE IF NOT EXISTS dacoaching.nested_table (
  id STRING,
  names
      STRUCT<
        first_name STRING,
        last_name STRING>,
  addresses
    ARRAY<
      STRUCT<
        status STRING,
        address STRING,
        city STRING,
        state STRING,
        zip STRING,
        numberOfYears STRING>>
) ;

INSERT INTO dacoaching.nested_table (id,
names,
addresses) values ("1",STRUCT<
        first_name STRING,
        last_name STRING>
      ("John","Doe"),
    ARRAY<
      STRUCT<
        status STRING,
        address STRING,
        city STRING,
        state STRING,
        zip STRING,
        numberOfYears STRING>>
      [("current","123 First Avenue","Seattle","WA","11111","1")])

INSERT INTO dacoaching.nested_table (id,
names,
addresses) values ("2",STRUCT<
        first_name STRING,
        last_name STRING>
      ("Jane","Doe"),
    ARRAY<
      STRUCT<
        status STRING,
        address STRING,
        city STRING,
        state STRING,
        zip STRING,
        numberOfYears STRING>>
      [("current","123 25th Street","New York","NY","10001","4"), ("former","123 First Avenue","Seattle","WA","11111","1")])

Output:
image2.png

image3.png

 

Json stored as JSON

BigQuery has a native column type called JSON. This makes it easy to load JSON data and query it with simple syntax.

select location from dacoaching.json_table

image3.png

How do I see the states for each row of data? As you can see above, each row of data has one value for state. BigQuery syntax for accessing JSON data stored like this is extremely easy.

select location.state from dacoaching.json_table

image1.png

If we again look at the above data we’ll see that location contains a field called former_locations which has multiple nested values. These values are stored as an array within the json. There’s a few different ways that we can query this data but keep in mind, we’d likely want to flatten this data too.

Firstly, we can see the data within this field using the same method we did to extract state.

select location.former_locations from dacoaching.json_table

image2.png

We’re likely going to want to extract out the information within this array, like the states. If we try to do this without specifying which array we are referring to or without flattening the data, this isn’t possible i.e. the below returns no results.

select location.former_locations.state from dacoaching.json_table

image4.png

However, we can extract some information out if we specify the index of the array that we want to get the price from, example below.

select location.former_locations[0].state from dacoaching.json_table

image5.png

To see how to flatten this data, check out the Json Arrays section.

Json Arrays

When the data within our JSON field is in an array format, this makes it a little bit more difficult to extract information from. However, the below code will do this for us.

If we want to see the data within the column, we can run the below query.
SELECT cart FROM dacoaching.json_table

image3.png

 However, running something like the below will return no results, because the data is stored as an array of strings within this field.

SELECT cart.item FROM dacoaching.json_table

We can flatten this data using the following but again, we can’t actually query each individual nested value as they are strings.

SELECT JSON_EXTRACT_ARRAY(cart, '$') FROM dacoaching.json_table

image5.png

Now, if the data only had one value per array, rather than multiple values, we could format our query like the following to extract values.

SELECT id, (
   SELECT JSON_VALUE(value, '$.item')
   FROM UNNEST(JSON_EXTRACT_ARRAY(cart)) value
) AS item_values
FROM dacoaching.json_table
where id < 3

image1.png

However, because we have multiple values per array, we will hit the following issue.

image4.png

 To avoid this, we query BigQuery using UNNEST to flatten the table and hence let us query the nested values.

SELECT id, JSON_VALUE(value, '$.item') AS item_values
FROM dacoaching.json_table
, UNNEST(JSON_EXTRACT_ARRAY(cart)) value

image2.png

 

Json stored as a String

Sometimes customers load their JSON data into a string column, but we can still extract this information out using BigQuery JSON functions.

SELECT user from dacoaching.json_table

image5.png

Above, the data is formatted as string values, so any JSON syntax won’t work with this data. We can use JSON_VALUE (or JSON_EXTRACT) to query the data.

SELECT JSON_VALUE(user, '$.age') AS age
FROM dacoaching.json_table;

image1.png

 If the data has been loaded as arrays in a string format, we can still extract the information out using a BigQuery function.

CREATE OR REPLACE TABLE dacoaching.json_table_2(user STRING);

INSERT INTO dacoaching.json_table_2 VALUES
('''[{"age":10,"name":"Alex"},{"age":11,"name":"Nick"}]''');

image2.pngJSON_EXTRACT_ARRAY lets us access each of the individual records and then we can use the information in the Json Arrays section to further analyse.

SELECT JSON_EXTRACT_ARRAY(user, '$') AS values
FROM dacoaching.json_table_2;

image3.png

JSON_VALUE_ARRAY would work if the data was in the following format i.e. scalar.

image4.png

 

Nested data

Nested data that isn’t repeated can be easily accessed using SQL dot notation.

select names
from dacoaching.nested_table

image1.png

As you can see, each column follows the format required to extract specific data.

select 
 names.first_name, names.last_name
from dacoaching.nested_table
group by 1, 2

image2.png

 

Nested and repeated data

Nested and repeated data can be a little trickier to access. We can no longer use the simple dot notation used for non-repeated nested data as we have multiple values per each row i.e. an array of structs.

select addresses
from dacoaching.nested_table

image1.png

We can see on Row 1, we have both the former and current address of the individual, something that previously may have required a one-to-many join, something that isn’t very performant in SQL databases.

select addresses.status
from dacoaching.nested_table

Using the above dot notation, we will receive an error:

image3.png

 

select addresses[offset(0)].status
from dacoaching.nested_table

We can access individual entries using indexing like the above, which can be useful if we want to extract the first value i.e. the current address.

However, if we want all entries in one query, we need to use UNNEST():

SELECT
id,
  a.address,
  a.state
FROM
  dacoaching.nested_table, UNNEST(addresses) AS a
WHERE
  a.state != 'CA'
    group by 1,2,3;

image2.png

 

Looker considerations

Looker adds a group by clause to every SQL statement it generates. This can mean it can be a little bit more difficult to query JSON data. The reason for this is that JSON data from BigQuery cannot be added to a Group By clause.

SELECT id, cart FROM dacoaching.json_table
group by 1,2
SELECT id, location.state FROM dacoaching.json_table
group by 1,2

image2.png

 Therefore, in order to model this data in a scalable fashion within Looker, we either need to UNNEST the data (See JSON Arrays Section) or use a JSON function that outputs a different formatted value e.g. scalar below.

SELECT JSON_VALUE(location, '$.state') AS json_text_string
from dacoaching.json_table
group by 1

image1.png

For Nested and Repeated data, we will need to bring the UNNEST statement into the LookML model. This will provide us a scalable way to query all the nested data.

 join: addresses {
    view_label: "Addresses"
    sql: LEFT JOIN UNNEST(${nested_table.addresses}) as addresses ;;
    relationship: one_to_many
  }

 

1 0 1,734
Authors