Solved: How does Big Query assign slots to the queries?

Kedar-k · 07-26-2023 07:33 AM

Hi all,

I want to understand how the slots are getting assigned to BQ jobs. Here are the two different use cases

Use case 1:

Slots have a threshold of 1200 and a baseline is 0:

The update statement processes 9GB of data in 6 secs, utilizing 104 slots.

Use case 2:

Slots have a threshold of 1200 and a baseline is 600:

The update statement processes 18GB of data in 134 secs, utilizing 40slots.

Question:

When the baseline is set to 600 i.e. 600 slots are readily available and no other jobs were running during case 2 why did I utilize less number of slots?

Can someone please help me understand this?

ms4446

Based on the information you have provided, it seems likely that the longer runtime and fewer slots used in use case 2 are due to the fact that the data was doubled in size.

When the data is doubled, the query needs to scan twice as many rows, which will take twice as long and require twice as many slots. This is because BigQuery needs to read each row in the dataset before it can process it.

If you are concerned about the performance of the second query, you can try to optimize the query by using a more efficient algorithm or by partitioning the data.

Here are some tips for optimizing queries:

Use a more efficient algorithm. Some algorithms are more efficient than others for certain types of data. For example, if you are querying a table with a large number of rows, you can use a hash join algorithm instead of a nested loop join algorithm.
Partition the data. If you have a large dataset, you can partition the data into smaller chunks. This will make it easier for BigQuery to read and process the data.
Use caching. BigQuery can cache the results of queries so that they can be reused later. This can help to improve the performance of queries that are run frequently.

Please note: The relationship between data size and slot usage is not always linear. BigQuery's slot allocation algorithm takes into account a variety of factors, and it's designed to optimize for overall system performance, not necessarily to use all available slots at all times. So, it's possible for a query to use fewer slots even when processing a larger amount of data.

View solution in original post

ms4446

Google Cloud BigQuery uses a dynamic slot allocation strategy for running queries. This means that the number of slots assigned to a query can change over time based on the workload and the availability of resources. The actual number of slots used by a query can be less than the maximum number of slots available in your reservation.

In your second use case, even though a baseline of 600 slots was available, the query only utilized 40 slots to process 18GB of data. This is because BigQuery only assigns the number of slots that are necessary for the query's execution.

The reason the second use case required fewer slots could be due to the fact that some of the data had already been processed by a previous query within the same reservation. BigQuery checks for any pre-processed data that can be reused at the start of a query. If such data is available, the query can be executed with fewer slots.

In this instance, a prior query had already processed 9GB out of the total 18GB of data. Consequently, the second query only needed to process the remaining 9GB, which it managed to do using 40 slots.

Here's a summary table of the factors influencing how BigQuery assigns slots to queries:

Factor	Explanation
Available Slot Count	The total count of slots available in the reservation assigned to the query.
Query Priority	The priority level of the query, which determines its scheduling order. High-priority queries are scheduled before low-priority ones.
Query Size and Complexity	The size and complexity of the query influence the number of slots it requires. Larger and more complex queries need more slots.
Pre-Processed Data	If there's any pre-processed data that can be reused, the query will require fewer slots for data processing.

Kedar-k

Hi,

Thank you for the explanation.

Actually, both of the use cases were run on different days and on different baselines. Since both runs are on different days, I assume it may not be using the pre-processed data.

I am more worried because baseline 600 took more time and fewer slots. The time difference looks too huge since we are trying to build a near real-time system.

Also, I am assuming that when the baseline is set to 600, they should be readily available for the job, am I correct here?

ms4446

You are right regarding the baseline set to 600 and the expected availability of slots for the job. Nevertheless, several factors might have influenced the longer runtime and fewer slots used in use case 2.

One potential factor is the variability in the data state between the two days when the queries were executed. For instance, if the data was updated on the second day, the use case 2 query would have been required to read the entire dataset from scratch. Consequently, this process would have taken more time and utilized more slots.

Another possibility is that the second query used different parameters compared to the first one. For instance, if the second query implemented a more restrictive filter, it would have scanned fewer rows, leading to reduced processing time and a lower number of required slots.

Moreover, it's essential to consider the presence of other queries running in the same reservation at the same time as the second query. In such a scenario, the second query would have needed to wait for available slots, consequently increasing its runtime.

To address concerns about the performance of the second query, consider the following troubleshooting steps:

Ensure that the data state is consistent on both days when the queries are executed.
Review the parameters used in the second query to identify any potential discrepancies.
Monitor the reservation's activity to detect any other queries that might be impacting the performance of the second query. By doing so, you can assess if slot availability is an issue.

Kedar-k

I will investigate the data at both runs. This should provide me with some insights. Data is definitely doubled in the other case i.e. from 9GB to 18GB.

I checked that the parameters and the query were exactly the same.

and regarding the reservation activity here's the monitoring graph I created. As you see the baseline is 600 since the availability never went below that. And many slots were available at that time, and that was the only job running at that point.

Highlighted time is where the query was running.

ms4446

Based on the information you have provided, it seems likely that the longer runtime and fewer slots used in use case 2 are due to the fact that the data was doubled in size.

When the data is doubled, the query needs to scan twice as many rows, which will take twice as long and require twice as many slots. This is because BigQuery needs to read each row in the dataset before it can process it.

If you are concerned about the performance of the second query, you can try to optimize the query by using a more efficient algorithm or by partitioning the data.

Here are some tips for optimizing queries:

Use a more efficient algorithm. Some algorithms are more efficient than others for certain types of data. For example, if you are querying a table with a large number of rows, you can use a hash join algorithm instead of a nested loop join algorithm.
Partition the data. If you have a large dataset, you can partition the data into smaller chunks. This will make it easier for BigQuery to read and process the data.
Use caching. BigQuery can cache the results of queries so that they can be reused later. This can help to improve the performance of queries that are run frequently.

Please note: The relationship between data size and slot usage is not always linear. BigQuery's slot allocation algorithm takes into account a variety of factors, and it's designed to optimize for overall system performance, not necessarily to use all available slots at all times. So, it's possible for a query to use fewer slots even when processing a larger amount of data.