While performing a GCS (parquet) load to BQ using the python SDK I get a debug message that seems to be backing off retries for a specific API call. Everything works but a debug message shows the slowing down of the program.
`DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/my-awesome-project/jobs/5b5ad66b-4050-3m3n-a7c3-f8e14a31c3a4?location=US&prettyPrint=false HTTP/1.1" 200 None
DEBUG:google.api_core.retry:Retrying due to , sleeping 5.1s ...`
I am curious if this is simply retrying in order to figure out if the job has completed? Effectively if I call `load_data_from_uri` will that call wait for the job to complete?
If this is the case can I simply _not_ wait for this and move on to the next load job?
Solved! Go to Solution.
Yes, your understanding is mostly correct. Here is some further clarification on the behavior and how to handle the retries:
Will load_data_from_uri Wait for Job Completion?
Yes, by default, the load_data_from_uri
function is a blocking call. This means that it will wait until the load job is complete before returning control back to your program. This is to ensure that the data is fully loaded before you try to use it in subsequent operations.
If you need asynchronous behavior (not waiting for completion), you can use the Client.create_job
method and set the wait
parameter to False
. This will allow you to submit the job and continue with other tasks while the job runs in the background.
Can You Move Onto the Next Load Job?
Yes, you can start multiple load jobs without waiting for each individual job to complete. This can be useful if the jobs are independent of each other. However, you should always be aware of BigQuery's quotas and limits to avoid hitting rate limits or overloading the system.
Recommendations:
load_data_from_uri
function or the Client.create_job
method.Example Snippet (Asynchronous Loading Without Waiting):
from google.cloud import bigquery
client = bigquery.Client()
for file_uri in file_uri_list:
load_job_config = bigquery.LoadJobConfig(...)
load_job = client.load_table_from_uri(
file_uri,
# ... other load job configuration parameters ...
job_config=load_job_config,
wait=False
)
print(f"Started load job: {load_job.job_id}")
# You can proceed with other tasks here.
# Use client.get_job(job_id) to check the status of each job later.
Yes, your understanding is mostly correct. Here is some further clarification on the behavior and how to handle the retries:
Will load_data_from_uri Wait for Job Completion?
Yes, by default, the load_data_from_uri
function is a blocking call. This means that it will wait until the load job is complete before returning control back to your program. This is to ensure that the data is fully loaded before you try to use it in subsequent operations.
If you need asynchronous behavior (not waiting for completion), you can use the Client.create_job
method and set the wait
parameter to False
. This will allow you to submit the job and continue with other tasks while the job runs in the background.
Can You Move Onto the Next Load Job?
Yes, you can start multiple load jobs without waiting for each individual job to complete. This can be useful if the jobs are independent of each other. However, you should always be aware of BigQuery's quotas and limits to avoid hitting rate limits or overloading the system.
Recommendations:
load_data_from_uri
function or the Client.create_job
method.Example Snippet (Asynchronous Loading Without Waiting):
from google.cloud import bigquery
client = bigquery.Client()
for file_uri in file_uri_list:
load_job_config = bigquery.LoadJobConfig(...)
load_job = client.load_table_from_uri(
file_uri,
# ... other load job configuration parameters ...
job_config=load_job_config,
wait=False
)
print(f"Started load job: {load_job.job_id}")
# You can proceed with other tasks here.
# Use client.get_job(job_id) to check the status of each job later.