Can BigQuery perform Left join on 24TB data?

Hi,

I want to use BigQuery to remove duplicate items between 2 tables (each table only has one column). I plan to perform a left join query between 2 tables. One table has about 20 TB of data and one table has 4 TB of data.

Can I use BigQuery with the default quota to process the above amount of data with just one Left join statement?

Illustration of the query I want to execute, I want the query returns "Green A" part only. (A is 4TB data table)

Illustration of the query I want to executeIllustration of the query I want to execute

 

Thank you!

 

Solved Solved
1 2 232
1 ACCEPTED SOLUTION

It feels like it should work.  I had a look at the BigQuery Quotas and Limits page here and couldn't find any references to large joins.  A glib answer from me would be "try it and see" but I am guessing that be concerning because of cost.  The list prices of on-demand queries for BigQuery is $6.25/TB scanned ... so if your query were to touch 24TB of data, that could be $150/query ... maybe by itself not a problem but could be costly if we have to keep running it over and over.   If I were sitting in your place, I'd probably run some experiments.  I'd create a couple of tables of say ... 5GB and 1GB and validate that my SQL is correct.  Prove to myself that my SQL is sounds and that the problem "works in the small".   I'd then try it with 10GB and 2GB tables, 20GB and 4GB ... and maybe a few more.  I'd plot some curves and try and get a feel for "how long" 20TB and 4TB might take.

With that in mind, I'd then look into creating a fixed "Capacity based pricing" BigQuery reservation.  You can create and destroy these as desired ... they don't represent a commitment to use for extended periods of time.  You could then create a 100 slot reservation.  One hundred slots will cost $4/hour.  Re-run the above tests to get timing of how long your query will take using Capacity based pricing.  If you are constraining yourself to 100 slots, it could be much longer.

The end goal will be an answer to the question "If I used fixed price capacity based pricing, how long would my query take?" ... since you are now running at a fixed price .... you may be able to run your query cheaper and ... one possible benefit for that is ... if it fails (and I'm not yet expecting it to do so), then the failure may be MUCH cheaper than running on-demand.

View solution in original post

2 REPLIES 2

It feels like it should work.  I had a look at the BigQuery Quotas and Limits page here and couldn't find any references to large joins.  A glib answer from me would be "try it and see" but I am guessing that be concerning because of cost.  The list prices of on-demand queries for BigQuery is $6.25/TB scanned ... so if your query were to touch 24TB of data, that could be $150/query ... maybe by itself not a problem but could be costly if we have to keep running it over and over.   If I were sitting in your place, I'd probably run some experiments.  I'd create a couple of tables of say ... 5GB and 1GB and validate that my SQL is correct.  Prove to myself that my SQL is sounds and that the problem "works in the small".   I'd then try it with 10GB and 2GB tables, 20GB and 4GB ... and maybe a few more.  I'd plot some curves and try and get a feel for "how long" 20TB and 4TB might take.

With that in mind, I'd then look into creating a fixed "Capacity based pricing" BigQuery reservation.  You can create and destroy these as desired ... they don't represent a commitment to use for extended periods of time.  You could then create a 100 slot reservation.  One hundred slots will cost $4/hour.  Re-run the above tests to get timing of how long your query will take using Capacity based pricing.  If you are constraining yourself to 100 slots, it could be much longer.

The end goal will be an answer to the question "If I used fixed price capacity based pricing, how long would my query take?" ... since you are now running at a fixed price .... you may be able to run your query cheaper and ... one possible benefit for that is ... if it fails (and I'm not yet expecting it to do so), then the failure may be MUCH cheaper than running on-demand.

Hi, sorry for late reply.

I have successfully runned a left join query on 24TB of data with just one run, and the query took approximately 4 hours to complete. Regarding the data, I only have a single column, meaning the entire 24TB dataset is involved in the mentioned query. Additionally, I utilize the us-east1 location to store the data, and this query automatically uses more than 2000 default slots. I hope this information can be helpful to someone else.

Thank you very much for your assistance.