Split table to smaller tables base on column value...

Daniel2112 · 01-30-2024 11:48 PM

Hi, I have a table with many columns: A, B, C,...

How can I split my table to smaller tables which have same value of A.

I don't want to loop over the table to export smaller tables base on each A values, because it will increase the cost.

Thank you.

ms4446

Here's how you can split a large table in BigQuery into smaller tables based on the values of column A, along with explanations and considerations to optimize this process.

There are a couple of efficient ways to do this in BigQuery without the need for explicit looping:

1. Direct Creation with Queries

This is the most straightforward approach if you know the distinct values of column A in advance:

-- Create 'table_A_value1' based on the value 'value1' in column A
CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_value1`
AS 
SELECT *
FROM `project_id.dataset_id.original_table`
WHERE A = 'value1';

-- Create 'table_A_value2'
CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_value2`
AS 
SELECT *
FROM `project_id.dataset_id.original_table`
WHERE A = 'value2';

-- ... and so on

Replace project_id, dataset_id, original_table, and the values ('value1', 'value2', etc.) with your actual project, dataset, table, and values from column A.

2. Dynamic SQL for Unknown A Values

Use this method if you don't know all the distinct values in column A beforehand:

DECLARE query_string STRING;
DECLARE values ARRAY<STRING>;

-- Get distinct values from column A
SET values = (
  SELECT ARRAY_AGG(DISTINCT A) 
  FROM `project_id.dataset_id.original_table`
);

-- Build dynamic SQL queries
FOR value IN (SELECT * FROM UNNEST(values)) DO
   SET query_string = CONCAT(
      "CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_", value, "` AS 
       SELECT * FROM `project_id.dataset_id.original_table` WHERE A = '", value, "';"
   );
   EXECUTE IMMEDIATE query_string;
END FOR;

Important Considerations

Cost: BigQuery charges you based on the amount of data processed by the query. The second (dynamic SQL) method might be slightly more expensive than the first as it involves reading the entire table to get unique values from column A.
Large Number of Distinct Values: If you have a very large number of distinct values in column A, the dynamic SQL approach would generate a considerable number of queries. It might be worth exploring more controlled splitting mechanisms.
Clustering (Optional): If you are frequently querying the smaller tables based on column A, consider clustering them by column A within each table. This can further improve query performance and reduce costs.

aisejas

Thanks for the reply, it helped me with a dynamic query I was writing at BigQuery, just one quick fix.
Using value in the dynamic query_string, generated this error:

Unable to coerce type STRUCT<f0_ STRING> to expected type STRING Signature CONCAT<BYTES, [BYTES, ...])

It looks like iterating the un-nested values doesn't give us a string but a struct, so I fixed it using value[0] instead of just value.

Split table to smaller tables base on column values