Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Split table to smaller tables base on column values

Hi, I have a table with many columns: A, B, C,...

How can I split my table to smaller tables which have same value of A.

I don't want to loop over the table to export smaller tables base on each A values, because it will increase the cost.

 

Thank you.

 

0 2 969
2 REPLIES 2

Here's how you can split a large table in BigQuery into smaller tables based on the values of column A, along with explanations and considerations to optimize this process.

There are a couple of efficient ways to do this in BigQuery without the need for explicit looping:

1. Direct Creation with Queries

This is the most straightforward approach if you know the distinct values of column A in advance:

 
-- Create 'table_A_value1' based on the value 'value1' in column A
CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_value1`
AS 
SELECT *
FROM `project_id.dataset_id.original_table`
WHERE A = 'value1';

-- Create 'table_A_value2'
CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_value2`
AS 
SELECT *
FROM `project_id.dataset_id.original_table`
WHERE A = 'value2';

-- ... and so on 
  • Replace project_id, dataset_id, original_table, and the values ('value1', 'value2', etc.) with your actual project, dataset, table, and values from column A.

2. Dynamic SQL for Unknown A Values

Use this method if you don't know all the distinct values in column A beforehand:

 
DECLARE query_string STRING;
DECLARE values ARRAY<STRING>;

-- Get distinct values from column A
SET values = (
  SELECT ARRAY_AGG(DISTINCT A) 
  FROM `project_id.dataset_id.original_table`
);

-- Build dynamic SQL queries
FOR value IN (SELECT * FROM UNNEST(values)) DO
   SET query_string = CONCAT(
      "CREATE OR REPLACE TABLE `project_id.dataset_id.table_A_", value, "` AS 
       SELECT * FROM `project_id.dataset_id.original_table` WHERE A = '", value, "';"
   );
   EXECUTE IMMEDIATE query_string;
END FOR; 

Important Considerations

  • Cost: BigQuery charges you based on the amount of data processed by the query. The second (dynamic SQL) method might be slightly more expensive than the first as it involves reading the entire table to get unique values from column A.
  • Large Number of Distinct Values: If you have a very large number of distinct values in column A, the dynamic SQL approach would generate a considerable number of queries. It might be worth exploring more controlled splitting mechanisms.
  • Clustering (Optional): If you are frequently querying the smaller tables based on column A, consider clustering them by column A within each table. This can further improve query performance and reduce costs.

Thanks for the reply, it helped me with a dynamic query I was writing at BigQuery, just one quick fix. 
Using value in the dynamic query_string, generated this error:

Unable to coerce type STRUCT<f0_ STRING> to expected type STRING Signature CONCAT<BYTES, [BYTES, ...])

 It looks like iterating the un-nested values doesn't give us a string but a struct, so I fixed it using value[0] instead of just value.