Solved: BigQuery external table using JSON files

amlanroy1980 · 03-30-2024 05:16 AM

I have e-commerce product data in JSON format. This data is frequently updated. I am planning to load this data into BigQuery. Given that the JSON files are huge (a few hundred attributes) and there are many updates, I found that it is not very cost-effective to store the entire data in BigQuery and update it frequently using merge queries.

I am now planning to store the frequently used attributes in individual columns in a BigQuery table. However, I also need to support the use cases when somebody wants to access any other attribute of the JSON for some ad hoc analysis. To address this, can I use the following strategy:

Store the JSON for a product in its own file in GCP in a specific directory. Create an external table using all the files in that directory. Each file's content becomes a row in the external table describing a single product. When updates happen to a product, I update the BigQuery table and also replace the existing file for that product with a new file.

Is this a reasonable approach? If yes, how can I create the external table from the directory?

ms4446

Yes, your approach to handling frequently updated e-commerce product data in Google Cloud BigQuery by combining the use of structured BigQuery tables for frequently accessed attributes and external tables for less frequently accessed, ad hoc analysis of JSON attributes is indeed a practical and cost-effective strategy.

Cost-Efficiency: By storing the bulk of your JSON data as external files and only loading frequently used attributes into core BigQuery tables, you significantly reduce storage costs. Updates become less computationally expensive as well.
Flexibility: The external table provides on-demand access to the full JSON data when needed for ad-hoc analysis, ensuring you don't lose out on the complete information set.
Performance: Extracting the commonly used attributes into a native BigQuery table optimizes query performance for frequent analytical use cases.

Creating the External Table

Here's how to create your external table in BigQuery, assuming your JSON files are in a Google Cloud Storage (GCS) bucket:

1. Prepare a Table Definition

If you want to explicitly define the schema:

Create a JSON file representing the schema of your JSON data.
You can use bq mkdef for schema auto-detection (see example below).

2. Use the CREATE EXTERNAL TABLE Command

 

CREATE EXTERNAL TABLE dataset_name.my_external_table
OPTIONS (
  format = 'JSON',
  uris = ['gs://my_bucket/json_data_directory/*'] 
);

Explanation

Replace dataset_name with the name of your BigQuery dataset.
Replace my_external_table with your desired external table name.
Replace gs://my_bucket/json_data_directory/ with the actual GCS path to your JSON files.

Example with Auto-Detection

If you don't have a predefined schema file:

 
# Auto-detect schema from a sample JSON file
bq mkdef --autodetect --source_format=NEWLINE_DELIMITED_JSON gs://my_bucket/json_data_directory/sample.json > schema.json

# Create the external table using the generated schema
bq mk --external_table_definition=schema.json dataset_name.my_external_table

Things to Note:

Data Types: BigQuery will infer data types from your JSON data. If needed, use a schema file to force specific data types.
Updates: When you replace a JSON file in your GCS directory, the changes will automatically reflect in your external table.
Wildcard: The * in the uris option allows the external table to reference all JSON files within the specified directory.

View solution in original post

ms4446

Yes, your approach to handling frequently updated e-commerce product data in Google Cloud BigQuery by combining the use of structured BigQuery tables for frequently accessed attributes and external tables for less frequently accessed, ad hoc analysis of JSON attributes is indeed a practical and cost-effective strategy.

Cost-Efficiency: By storing the bulk of your JSON data as external files and only loading frequently used attributes into core BigQuery tables, you significantly reduce storage costs. Updates become less computationally expensive as well.
Flexibility: The external table provides on-demand access to the full JSON data when needed for ad-hoc analysis, ensuring you don't lose out on the complete information set.
Performance: Extracting the commonly used attributes into a native BigQuery table optimizes query performance for frequent analytical use cases.

Creating the External Table

Here's how to create your external table in BigQuery, assuming your JSON files are in a Google Cloud Storage (GCS) bucket:

1. Prepare a Table Definition

If you want to explicitly define the schema:

Create a JSON file representing the schema of your JSON data.
You can use bq mkdef for schema auto-detection (see example below).

2. Use the CREATE EXTERNAL TABLE Command

 

CREATE EXTERNAL TABLE dataset_name.my_external_table
OPTIONS (
  format = 'JSON',
  uris = ['gs://my_bucket/json_data_directory/*'] 
);

Explanation

Replace dataset_name with the name of your BigQuery dataset.
Replace my_external_table with your desired external table name.
Replace gs://my_bucket/json_data_directory/ with the actual GCS path to your JSON files.

Example with Auto-Detection

If you don't have a predefined schema file:

 
# Auto-detect schema from a sample JSON file
bq mkdef --autodetect --source_format=NEWLINE_DELIMITED_JSON gs://my_bucket/json_data_directory/sample.json > schema.json

# Create the external table using the generated schema
bq mk --external_table_definition=schema.json dataset_name.my_external_table

Things to Note:

Data Types: BigQuery will infer data types from your JSON data. If needed, use a schema file to force specific data types.
Updates: When you replace a JSON file in your GCS directory, the changes will automatically reflect in your external table.
Wildcard: The * in the uris option allows the external table to reference all JSON files within the specified directory.

kolban

How large are your JSON files (in MBytes and rows)?
How do they change? (Updates to the files, new files, appends to existing files)
How do you plan to query the data that is given to you in JSON format? (real-time queries, batch processing, BI)
How much latency can you accept between when new JSON files are available and when the data will show up in BigQuery queries?

Some immediate thoughts that come to mind:

1. Google doesn't charge for ingestion of new data into BigQuery
2. The size of data in JSON format is usually larger than the size of data stored in BigQuery internal tabular format
3. External tables are slower to query than internal BigQuery tables.