I have e-commerce product details stored in CSV format in the GCS bucket. Each file contains the attributes of a single product. The number of such files is approximately 4 billion.
I am planning to use a BigQuery external table to query the data.
It worked when I tested with 10 million such files. My question is, will it work when the number of files is as high as 4 billion? Even if it takes time (15-30 mins), it is fine.
What are the other recommended options to query the data if we don't use the BigQuery external table?
While BigQuery can handle large datasets, querying 4 billion individual files might cause performance bottlenecks. The overhead associated with opening, reading, and closing each file can be substantial. Listing and managing metadata for 4 billion files in Google Cloud Storage (GCS) can be very slow and inefficient. Even if BigQuery can technically handle the workload, query performance might degrade significantly due to the sheer number of files.
Alternative Approaches: