Im using BigQuery as my DB with Scrapy spider. Below are 2 pipelines to store data into DB. One uses Insert, another Update methods. The Update method is 70 times slower then insert (merely 20 updated records per minute). Update take 3.560 seconds while Insert only 0.05 seconds. Where am I wrong and how to speed up Update method?
P.S. Current table size is around 20k records, potential size can be as large as 500 000 records. Need to update records daily.
Update method
# Define the update query query = f""" UPDATE `{self.dataset_id}.{self.table_id}` SET `Sold Status` = '{data['Sold Status']}', `Amount of Views` = '{data['Amount of Views']}', `Amount of Likes` = '{data['Amount of Likes']}', `Sold Date & Time` = '{data['Sold Date & Time']}' WHERE `Item number` = '{data['Item number']}' """ start_time = time.time() # Run the update query job = self.client.query(query) # Wait for the job to complete job.result() # Check if the query was successful if job.state == 'DONE': print('Update query executed successfully.') else: print('Update query failed.') end_time = time.time() execution_time = end_time - start_time logging.info(execution_time) return item
Insert method
start_time = time.time() data = item slug = data['slug'] if slug in self.ids_seen: raise DropItem("Duplicate item found: {}".format(slug)) else: data.pop('slug', None) self.ids_seen.add(slug) table_ref = self.client.dataset(self.dataset_id).table(self.table_id) # Define the rows to be inserted rows = [ data ] # Insert rows into the table errors = self.client.insert_rows_json(table_ref, rows) if errors == []: print("Rows inserted successfully.") else: print("Encountered errors while inserting rows:", errors) end_time = time.time() execution_time = end_time - start_time logging.info(execution_time) return item
Solved! Go to Solution.
Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.
Basically what i do is:
Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.
Basically what i do is:
Hey @bbjjj thanks for sharing this challenge you're having with updates speeds in BigQuery. BigQuery natively has a limit of 10,000 rows that can be updated in a single request, so speed will definitely be slowed even when using an API. I can appreciate that having to split these jobs into multiple requests is time consuming, which will slow down the overall process.
I'll pass this feedback along to the team. Use cases are always helpful context for us. Do you mind what industry you're in?
Thanks!