Is there a way to calculate slot wait time on waiting for output from upstream stage

Hi, I was trying to figure out the slot time wasted in waiting for upstream stage to finish. Say I have a BQ job that has two stages, stage2 depends on the out of stage1. Imagine stage1 and stage2 were running in parallel but stage1 got only half of the slots it needed. So at some point, stage2 would be blocked waiting for the output from stage1. Till stage1 got more slots and finished its work, the slot time of stage2 was wasted. Is there a way I could get a rough estimate of how much stage2 slot time was wasted waiting for stage1?

I notice there were waitMsAvg/waitMsMax in the ExplainQueryStage. For waitMsAvg, the doc says "Milliseconds the average shard spent waiting to be scheduled.". Does that mean the wait time here is between the task is created and the task is sent to a slot? In other words, the wait time has absolutely no overlap with slot time, is that correct? 

Solved Solved
0 3 1,083
1 ACCEPTED SOLUTION

 

You must have query requests that is bigger than your available slots. Waiting time is just the time spent for work to be scheduled and it is not an overlap since query jobs have their timeline of execution.

This timeline provides an accounting of units of work completed, pending, and active within query workers. A query might have multiple stages with active workers simultaneously, so the timeline is intended to show overall progress of the query.


You can refer to this documentation [1].

Since your BigQuery job in stage1 has accumulated the slots, stage2 queued up and waited for an available slot. The total waiting time for stage2 only depends on how fast stage1 can free up a slot and BigQuery ensures that it has given fair scheduling.

You can upsize your slots to avoid wasted slot time or waiting time. For on-demand pricing[2], demanding of additional slots charges you for free[3].

[1] https://cloud.google.com/bigquery/docs/best-practices-performance-overview#query_plan_and_timeline
[2] https://cloud.google.com/bigquery/pricing#on_demand_pricing
[3] https://cloud.google.com/bigquery/docs/release-notes#December_10_2019

View solution in original post

3 REPLIES 3

 

You must have query requests that is bigger than your available slots. Waiting time is just the time spent for work to be scheduled and it is not an overlap since query jobs have their timeline of execution.

This timeline provides an accounting of units of work completed, pending, and active within query workers. A query might have multiple stages with active workers simultaneously, so the timeline is intended to show overall progress of the query.


You can refer to this documentation [1].

Since your BigQuery job in stage1 has accumulated the slots, stage2 queued up and waited for an available slot. The total waiting time for stage2 only depends on how fast stage1 can free up a slot and BigQuery ensures that it has given fair scheduling.

You can upsize your slots to avoid wasted slot time or waiting time. For on-demand pricing[2], demanding of additional slots charges you for free[3].

[1] https://cloud.google.com/bigquery/docs/best-practices-performance-overview#query_plan_and_timeline
[2] https://cloud.google.com/bigquery/pricing#on_demand_pricing
[3] https://cloud.google.com/bigquery/docs/release-notes#December_10_2019


@Shengpei wrote:

Hi, I was trying to figure out the slot time wasted in waiting for upstream stage to finish. Say I have a BQ job that has two stages, stage2 depends on the out of stage1. Imagine stage1 and stage2 were running in parallel but stage1 got only half of the slots it needed. So at some point, stage2 would be blocked waiting for the output from stage1. Till triple fusions infinite fusion stage1 got more slots and finished its work, the slot time of stage2 was wasted. Is there a way I could get a rough estimate of how much stage2 slot time was wasted waiting for stage1?

I notice there were waitMsAvg/waitMsMax in the ExplainQueryStage. For waitMsAvg, the doc says "Milliseconds the average shard spent waiting to be scheduled.". Does that mean the wait time here is between the task is created and the task is sent to a slot? In other words, the wait time has absolutely no overlap with slot time, is that correct? 


 So at some point, stage2 would be blocked waiting for the output from stage1. Till stage1 got more slots and finished its work, the slot time of stage2 was wasted. Is there a way I could get a rough estimate of how much stage2 slot time was wasted waiting for stage1?

It feels like we should be able to query the Job Explain Query Stage records ...

https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#ExplainQueryStage

Sadly, that's about all I personally have on the topic.  What I would do is pick a job that is a suspect and start grunging through the data and see what we see.  Hopefully there are guides or article that describe how to interpret the data.