One of our team is testing the Export to GCS for the data archival. It is observed that the EXPORT DATA produces too many files (nearly 2000 files for 1 partition; filesize less than 1GB) in the storage bucket. But, if the EXPORT DATA statement has a LIMIT specified, the number of files created is reduced drastically. Is this working as expected? Any documentation somewhere about this LIMIT clause that reduces the number of files created?
Hi @mdideles,
To answer your query, yes, this is expected behavior.
Instead of attempting to control the exact number of files during export, consider using a wildcard to export data into multiple files. You can export up to 1 GB of table data to a single file; for larger datasets, using a wildcard will allow you to export data across multiple files, which may vary in size and can even exceed 1 GB.
Here are some documentation that might be helpful to you:
I hope the above information is helpful.
@NorieRam thanks for the reply. We are not attempting to control the exact number of files at all. We even follow that wildcard that you mentioned. This too many small size files and empty files should not be expected. Why can't the export maximize the files instead of creating this useless files?
Hi @mdideles,
While it might seem counterintuitive, BigQuery prioritizes export speed by leveraging parallelism. This means it divides the data into smaller chunks to process concurrently, which inevitably leads to a larger number of files.
The tradeoff is speed vs. file count. While it might require some post-processing to merge files (using tools like gsutil compose), this approach significantly reduces the overall export time, especially for substantial datasets.
@NorieRam I understand that tradeoff but, at least, the export should merge these files in the end.
Anyways, I read from other cases that adding LIMIT clause on the export avoid these too many files. What's the implication of these LIMIT in the performance. just asking if you know. thanks.