Hi there,
I have started using BI engine for one of the projects and when i ran a query recently i noticed that it did not utilize the BI engine and instead it's showing this message - " Cannot broadcast table X: number of files 100 > supported limit of 20."
The query contains a left join to table X.
I am on a flat-rate pricing with a reservation of 10GB. i did not specify any preferred tables.
Could someone please explain me the reason in a bit more detail and what i can do to make my query utilize the BI engine?
The BI Engine is a feature of BigQuery that allows you to run queries on large datasets much faster than you could with the native BigQuery query engine. However, there are some limitations to the BI Engine, one of which is that it cannot broadcast tables with more than 20 files. In your case, the table X that you are joining to has 100 files, which is more than the 20 file limit. This is why the BI Engine is not being used for your query. There are a few things that you can do to make your query utilize the BI Engine:
Reduce the number of files in table X: You could achieve this by merging some of the files or dividing the table into smaller ones. This is often the most straightforward solution to the issue.
Alter the join type: Switching to an inner join from a left join might help, as inner joins require less data to be scanned. However, keep in mind that this could alter the results of your query, as an inner join might exclude some data that a left join would include. Therefore, this solution should only be implemented if it's acceptable to modify the query's results.
Specify preferred tables: The BI Engine allows you to designate which tables should be cached in memory. This can enhance performance for queries that frequently access those tables.
Increase your reservation: If your query is running into memory limits with the BI Engine, you might need to increase your reservation. However, while this could potentially solve the problem, it would also raise the cost, so this trade-off should be carefully considered.
Remember, you're on a flat-rate pricing plan with a 10GB reservation, so you should be able to utilize the BI Engine for your query. However, depending on the specifics of your data and query, the optimal solution may vary. It's always a good idea to experiment with different approaches and monitor their impact on performance.
Hi @ms4446 , Thank you so much for your quick response. just a follow up question regarding point 1
- "Reduce the number of files in table X" : What exactly are you referring to ? Do files mean the no of partitions? In that case, table X is not partitioned at all.
Here is the storage info of the table.
Could you please clarify?
Let me clarify. In BigQuery, when we refer to "files", it's typically in the context of how the data is physically stored on Google Cloud Storage. Each "file" is essentially a chunk of your table's data. These files are created and managed automatically by BigQuery and are not directly related to partitions.
The number of files can increase due to various factors such as the number of load jobs, query jobs, or copy jobs performed on the table. Each of these operations can create new files. Also, modifications to the table like updates or deletes can result in additional files.
In your case, since the table is not partitioned, the number of files is likely due to these types of operations. Unfortunately, BigQuery does not provide a direct way to control the number of files.
However, you can indirectly reduce the number of files by creating a new table from the existing one. When you create a new table based on a SELECT * query from the existing table, BigQuery will often consolidate the data into fewer files. Here's an example of how you might do this:
CREATE TABLE `project.dataset.new_table` AS
SELECT *
FROM `project.dataset.old_table`
After running this query, new_table
should have the same data as old_table
, but potentially with fewer files. You can then try your original query with new_table
to see if it can utilize BI Engine.
Could this metric of number of files be exposed e.g. via INFORMATION_SCHEMA ? We see same reasons for some of our queries not using BI engine and it's not obvious what's the issue.
No, the number of underlying files for a table in BigQuery is not directly exposed via INFORMATION_SCHEMA as of today. The internal details, including the number of files, are abstracted from users.
There is no public API or SQL query that can retrieve the number of underlying files for a table in BigQuery. However, you can monitor various metrics related to table storage and performance using Cloud Monitoring. These metrics will provide insights into storage usage and query performance, not the specific number of files.
We are working with relatively small table - less then 15GB, if it is fully partitioned and clustered BI engine is disabled due to the 1000 files >20 files. If it is neither, it is still disabled with 500 files >20 files. We even tested 50GB engine for 5 preferred tables of total size of 40GB and BI engine is still not used, any suggestions how to troubleshoot would help
Thanks
Sure, here are a few suggestions to troubleshoot why BI Engine might not be used for your query:
Here are some additional things to consider:
We simplified the query and the last item preventing the use of BI engine is the number of files (500>20)
Review the number of files. - I thought you said we don't have access to this metric, not sure how to review it.
The offending table is 13 GB and it was recreated from scratch -> INSERT-SELECT with several configurations (partitioning+clustering, partitioning only, clustering only, no-partitioning and no-clustering)
The result for all of the above is the same, BI engine is not used due to the large file number limitation.
The BI engine allocation is significantly larger then all the data to be cached (No Insufficient memory reasons)
Thanks for your help
Sorry about my oversight in my previous response. Thank you for pointing it out.
Additional points:
I hope this clarifies things
I'm a little confused by the suggestions. Does BI Engine have a hard limit of joining to tables that are split into at most 20 files? Does the reserved slot allocation, and setting preffered tables matter at all if 90 % of our queries need to join a very large table that exceeds the 20 file limitation?
BI Engine performs best with data that is pre-joined or pre-aggregated and involves a small number of joins. This is especially true when one side of the join is large, and the others are much smaller, such as when querying a large fact table joined with a small dimension table. Combining BI Engine with materialized views, which perform joins to produce a single large, flat table, can be beneficial. This ensures that the same joins don't have to be executed for every query.
Does BI Engine reservation size makes a difference when dealing with the 20 file limitation? If I say that BI engine reservation size is 100 GB then?
Unfortunately, increasing the BI Engine reservation size will not directly overcome the 20-file limitation associated with broadcast joins. Here's an in-depth look at why this is the case:
Nature of the Limitation
The 20-file limit is an operational constraint within BI Engine, specifically designed for the efficiency of broadcast joins. This limitation is about how BI Engine replicates data across worker nodes for join operations, rather than being a matter of available memory.
Purpose of Reservation Size Increase
Increasing BI Engine reservation size offers significant benefits by providing more memory to cache larger datasets and to process more complex queries that involve intensive computations within BI Engine. However, these benefits are primarily related to memory capacity rather than the structural limitations of data distribution for joins.
Potential Indirect Benefits of a Larger Reservation
Non-Broadcast Joins: A larger BI Engine reservation might indirectly benefit certain queries by allowing more data to be cached, potentially enabling BI Engine to utilize non-broadcast join strategies that aren't subject to the 20-file limit. However, this is more of a side effect than a direct solution to the file limit issue.
Supporting Workarounds: Implementing strategies such as pre-joining data or consolidating tables into larger, more manageable datasets can be more effective with a larger BI Engine reservation. These approaches often result in datasets that are better suited for caching and processing within BI Engine's enhanced memory capacity.
Important Considerations
Assessment of Bottlenecks: It's crucial to identify the primary bottlenecks in your queries before deciding to increase your BI Engine reservation size. If the main issue is the 20-file limit for broadcast joins, focusing on optimizing your data structure and query design may offer more immediate benefits. However, if your analysis reveals that memory capacity is a constraint for your BI Engine operations, then a larger reservation could indeed improve overall query performance.
While a larger BI Engine reservation size enhances the engine's ability to cache and process data, it does not directly address the 20-file limitation for broadcast joins. A comprehensive approach that includes assessing your specific needs, optimizing data structures, and potentially increasing memory reservation (if justified by other performance constraints) is recommended for maximizing BI Engine's effectiveness.
We are facing the same issue as others in this forum where our BI engine reason error says number of files 21 > supported limit of 20.. In the query where this broadcasted table is being used, I had only one column selected and also applying filters to the query to see if can overcome the above error. Only caveat to this table is it does not have any partitioning but only clustering is there.
Size of the table is 1.2 GB and contains records of about 2 Million.
Can some one provide recommendations what can be done to reduce the table files size. I tried create table as, and it did not make any difference.
What has been really helpful for us was reducing the table size. So the less accessed data possibly data that was more than 6 months old, was not actively being used. So, we just exported that to a different table and deleted it from the main table. Post that, this error was gone. Maybe that will help you too.
Hi,
I was able to get around this by using an ORDER BY clause.
CREATE TABLE `project.dataset.new_table` AS
SELECT *
FROM `project.dataset.old_table`
ORDER BY <some column>