Solved: Bigquery Concurrency

gsanjeevi · 08-29-2023 01:36 PM

Hello,

We are running a multi threaded python application which makes the Bigquery connection to read some data.

With the max thread of 10, we are able to run the code successfully but when we increase the threads to 30, it errors out.

"Connection pool is full, discarding connection: bigquery.googleapis.com. Connection pool size: 10"

Does that mean max connection we can make for an application is 10 and is there any way we can improve it?

lsolatorio

Hi @gsanjeevi,

Welcome and thank you for reaching out to the community for help.

It appears that the default pool size is indeed 10 as indicated in this published python document.

I found this GitHub post that I believe proves to be a good workaround for your use case, from this as reference, please try this code snippet if it will work on your end.

client = bigquery.Client()
adapter = requests.adapters.HTTPAdapter(pool_connections=128,
pool_maxsize=128,max_retries=3)
client._http.mount("https://",adapter)
client._http._auth_request.session.mount("https://",adapter)
query_job = client.query(QUERY)

Hope this helps.

View solution in original post

lsolatorio

Hi @gsanjeevi,

Welcome and thank you for reaching out to the community for help.

It appears that the default pool size is indeed 10 as indicated in this published python document.

I found this GitHub post that I believe proves to be a good workaround for your use case, from this as reference, please try this code snippet if it will work on your end.

client = bigquery.Client()
adapter = requests.adapters.HTTPAdapter(pool_connections=128,
pool_maxsize=128,max_retries=3)
client._http.mount("https://",adapter)
client._http._auth_request.session.mount("https://",adapter)
query_job = client.query(QUERY)

Hope this helps.

gsanjeevi

I tried this some time last week and it worked for me. Thanks for posting.

Akash101

This has resolved the threading issue, but now bq is throwing error of too many DML statements outstanding against table , limit is 20.

Basically I can't have more than 20 queries pending in a queue for a single table.

Is there any way to incrrase this limit ?

kolban

Reading the quotas found here ... there does indeed appear to be a documented limit on the number of queued DML statements. You asked "Basically I can't have more than 20 queries pending in a queue for a single table" ... to which my answer is YES ... you most surely can ... have MANY more than 20 queries queued for a single table ... however ... a DML statement is not a query. See here. Google defines DML as INSERT, UPDATE, DELETE or MERGE statements. These statements mutate the data. Remember, BQ is an OLAP database not an OLTP database. BQ is optimized for massive queries and not for transactional updates. What is the nature of the activity you are performing where you might have 20 or more DML activities outstanding at one time. The notion that BQ is complaining because we hit a quota limit is likely an indication that we may be trying to achieve something that is hard to achieve or there is a better (more BQ centric) solution.

Akash101

I want to delete and insert certain rows in a table. As number of rows to be deleted and insert are more, I've added threading to decrease the process time. But while doing that I'm getting earlier mentioned error. So I can't increase my threading above 20 .

Is there any alternative?

kolban

One possibility may be to treat the changes to your BigQuery table as a stream of CDC processing changes. See Stream table updates with change data capture. You haven't yet spoken about how you are determining which rows are to be deleted from the existing table and which rows are to be inserted. I'm going to imagine some external sources system (IoT? External data files?) that you are reading and then, based on that, determining whether to perform a DELETE or an INSERT. Instead of performing the DELETE or INSERT SQL statements ... you could use the Storage Write API and "always" do a streaming Storage Write with the _CHANGE_TYPE column set to be either UPSERT or DELETE. BigQuery will then manage that MERGE of the changes into the table in the background. This should maximize your performance. I'm still curious on how often you are performing deletes/inserts into the table? Is this a once a day, once an hour or a continuous story?