Solved: Extract BigQuery Table Definition All at Once

mdideles · 04-01-2024 11:59 AM

Is there a way to get all the Table information in our ORG all at once..

These are the table info that we are just interested in.. We query the info_schema for each project with a script but this is timing out after 24 hours because we have thousands of projects and each project has a lot of tables.

table_catalog
table_schema
table_name
column_name
table_type

kolban

Unfortunately ... no. Your script/programmatic approach is the right way to go. You'll want to:

1. Build a list of all of your projects in your organization
2. For each project, query [PROJECT_ID].region-[REGION].INFORMATION_SCHEMA.TABLES

You will repeat (2) for each of the regions you are using.

Where I think we'll have to dig deeper is into the notion of what is causing the script to timeout. How is the script being run? What kind of error/timeout message is being generated? Is it a security error that a security token isn't being refreshed?

Make an estimate of how long you anticipate the work to run. For example ... if you have 2000 projects and each project has 100 datasets and each dataset has 50 tables .... then that would be information for 2000 * 100 * 50 tables = 10 Million tables.

Are you absolutely certain you need all this data at once?

Another thought ... I wonder if there is any mileage in using Dataplex Data Catalog? Might we be able to switch that on and then ask Data Catalog to return its vision of the landscape? Be careful ... it is likely Dataplex Data Catalog will run up costs to examine and build the metadata for the tables ... and if there are really 10 Million tables, that could be a non-trivial amount. Please investigate such charges first.

View solution in original post

kolban

Unfortunately ... no. Your script/programmatic approach is the right way to go. You'll want to:

1. Build a list of all of your projects in your organization
2. For each project, query [PROJECT_ID].region-[REGION].INFORMATION_SCHEMA.TABLES

You will repeat (2) for each of the regions you are using.

Where I think we'll have to dig deeper is into the notion of what is causing the script to timeout. How is the script being run? What kind of error/timeout message is being generated? Is it a security error that a security token isn't being refreshed?

Make an estimate of how long you anticipate the work to run. For example ... if you have 2000 projects and each project has 100 datasets and each dataset has 50 tables .... then that would be information for 2000 * 100 * 50 tables = 10 Million tables.

Are you absolutely certain you need all this data at once?

Another thought ... I wonder if there is any mileage in using Dataplex Data Catalog? Might we be able to switch that on and then ask Data Catalog to return its vision of the landscape? Be careful ... it is likely Dataplex Data Catalog will run up costs to examine and build the metadata for the tables ... and if there are really 10 Million tables, that could be a non-trivial amount. Please investigate such charges first.

mdideles

Since there are too many projects and tables in each project, the script will run more than 24 hours. the job generates the INSERT statement for each table and run the Insert to insert the data into a table. the runtime would take 1 to 2 seconds, or even more. A stored procedure generates this insert statements joining the columns and tables views. It is not efficient as it runs single insert at a time. they are also keeping a history so there is a logdate for each run.

Will look into the dataplex but since the discovery would incur cost, the team might not use the option but we'll see.

yan-hic

For what it's worth, we have created a simple catch-all remote py function that executes the passed SQL. Very handy to run many queries in parallel.

Applied to your use case, it should never timeout