Solved: Google BigQuery Update is 70x slower then Insert. ...

bbjjj · 05-19-2023 10:08 PM

Im using BigQuery as my DB with Scrapy spider. Below are 2 pipelines to store data into DB. One uses Insert, another Update methods. The Update method is 70 times slower then insert (merely 20 updated records per minute). Update take 3.560 seconds while Insert only 0.05 seconds. Where am I wrong and how to speed up Update method?

P.S. Current table size is around 20k records, potential size can be as large as 500 000 records. Need to update records daily.

Update method

# Define the update query
query = f"""
        UPDATE `{self.dataset_id}.{self.table_id}`
        SET `Sold Status` = '{data['Sold Status']}',
            `Amount of Views` = '{data['Amount of Views']}',
            `Amount of Likes` = '{data['Amount of Likes']}',
            `Sold Date & Time` = '{data['Sold Date & Time']}'
        WHERE `Item number` = '{data['Item number']}'
    """
start_time = time.time()
# Run the update query
job = self.client.query(query)

# Wait for the job to complete
job.result()

# Check if the query was successful
if job.state == 'DONE':
    print('Update query executed successfully.')
else:
    print('Update query failed.')
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)

return item

Insert method

start_time = time.time()
data = item
slug = data['slug']
if slug in self.ids_seen:
    raise DropItem("Duplicate item found: {}".format(slug))
else:
    data.pop('slug', None)
    self.ids_seen.add(slug)
    table_ref = self.client.dataset(self.dataset_id).table(self.table_id)

    # Define the rows to be inserted
    rows = [
        data
    ]

    # Insert rows into the table
    errors = self.client.insert_rows_json(table_ref, rows)

    if errors == []:
        print("Rows inserted successfully.")
    else:
        print("Encountered errors while inserting rows:", errors) 
    end_time = time.time()
    execution_time = end_time - start_time
    logging.info(execution_time)
    
    return item

bbjjj

Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.

Basically what i do is:

Create new temporary table.
Batch append all scraped data as json to this table
Run Update command from new table to my master table (which takes merely 5 seconds overall instead of 5 seconds per update query. How come?)
Truncate temporary table getting it ready for the next use in couple hours (FYI you cant use truncated table immediately. smth between 2 and 15 minutes should pass for the table to be ready for insert. So far im pretty happy with results and will stick to this solution for a while.

View solution in original post

bbjjj

Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.

Basically what i do is:

Create new temporary table.
Batch append all scraped data as json to this table
Run Update command from new table to my master table (which takes merely 5 seconds overall instead of 5 seconds per update query. How come?)
Truncate temporary table getting it ready for the next use in couple hours (FYI you cant use truncated table immediately. smth between 2 and 15 minutes should pass for the table to be ready for insert. So far im pretty happy with results and will stick to this solution for a while.

Roderick

Hey @bbjjj thanks for sharing this challenge you're having with updates speeds in BigQuery. BigQuery natively has a limit of 10,000 rows that can be updated in a single request, so speed will definitely be slowed even when using an API. I can appreciate that having to split these jobs into multiple requests is time consuming, which will slow down the overall process.

I'll pass this feedback along to the team. Use cases are always helpful context for us. Do you mind what industry you're in?

Thanks!

Google BigQuery Update is 70x slower then Insert. How to fix?