Solved: Bigquery load from GCS using python client API sle...

matt-hexvarium · 01-17-2024 07:48 PM

While performing a GCS (parquet) load to BQ using the python SDK I get a debug message that seems to be backing off retries for a specific API call. Everything works but a debug message shows the slowing down of the program.

`DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/my-awesome-project/jobs/5b5ad66b-4050-3m3n-a7c3-f8e14a31c3a4?location=US&prettyPrint=false HTTP/1.1" 200 None
DEBUG:google.api_core.retry:Retrying due to , sleeping 5.1s ...`

I am curious if this is simply retrying in order to figure out if the job has completed? Effectively if I call `load_data_from_uri` will that call wait for the job to complete?

If this is the case can I simply _not_ wait for this and move on to the next load job?

ms4446

Yes, your understanding is mostly correct. Here is some further clarification on the behavior and how to handle the retries:

DEBUG:urllib3.connectionpool…: This message, coming from the urllib3 library, indicates that a successful "GET" request was made to the BigQuery API to check the status of your load job.
DEBUG:google.api_core.retry…: This message is part of Google's API core retry mechanism. This intelligent retry system is a standard way for the SDK to handle potential temporary network issues or API rate limits that might occur.

Will load_data_from_uri Wait for Job Completion?

Yes, by default, the load_data_from_uri function is a blocking call. This means that it will wait until the load job is complete before returning control back to your program. This is to ensure that the data is fully loaded before you try to use it in subsequent operations.

If you need asynchronous behavior (not waiting for completion), you can use the Client.create_job method and set the wait parameter to False. This will allow you to submit the job and continue with other tasks while the job runs in the background.

Can You Move Onto the Next Load Job?

Yes, you can start multiple load jobs without waiting for each individual job to complete. This can be useful if the jobs are independent of each other. However, you should always be aware of BigQuery's quotas and limits to avoid hitting rate limits or overloading the system.

Recommendations:

Asynchronous Loading: If your workflow allows for it, initiate multiple load jobs asynchronously. This is an efficient approach when dealing with large datasets or when you need to load multiple files at once.
Explicit Checking: If knowing the exact moment of job completion is critical for your workflow, you will need to explicitly check the job status using the job ID that is returned by either the load_data_from_uri function or the Client.create_job method.

Example Snippet (Asynchronous Loading Without Waiting):

from google.cloud import bigquery

client = bigquery.Client() 

for file_uri in file_uri_list: 
    load_job_config = bigquery.LoadJobConfig(...)
    load_job = client.load_table_from_uri(
        file_uri, 
        # ... other load job configuration parameters ...
        job_config=load_job_config, 
        wait=False 
    )  
    print(f"Started load job: {load_job.job_id}")

    # You can proceed with other tasks here.
    # Use client.get_job(job_id) to check the status of each job later.

View solution in original post

ms4446