Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Move approx 100 million records from BigQuery to Cloud SQL

Hello Team,

I have a Apache Beam dataflow implemented to transfer approximately 100 million records from Q to Cloud SQL. But as the dataflow runs one insert query per record, the CPU of Cloud SQL instance is getting fully utilized. I want to fix this. We cannot increase the Cloud SQL CPU size as the actual app usage is low (5-10%), this problem occurs during daily batch run.

One of the suggested way by DBA is export BQ data to GCS and then import using Cloud SQL API. But this creates another problem that we can import only single file at a time.

I wanted to understand what can be the best possible solution for this.

1. BQ to GCS to Cloud SQL as stated above.

2. Increase Cloud SQL CPU just during the batch run.

3. Any way to optimize/reduce the number of SQL inserts being triggered in dataflow? I tried setting up batch size - but that is only a commit size, no of insert statements are still same.

0 2 168
2 REPLIES 2

Hi @aniketkhadke,

This is a pretty common bottleneck when using Dataflow to write large volumes into Cloud SQL—especially if you're doing one insert per row. At 100M records, that’s going to crush the SQL CPU during the batch window.

  1. The BQ ➝ GCS ➝ Cloud SQL Import approach is generally the most performant way to get large amounts of data into Cloud SQL. It bypasses the row-by-row processing bottleneck entirely and uses Cloud SQL's optimized bulk loading mechanism. However, you're right that the Cloud SQL import API can only handle one file at a time. This is a known limitation of Cloud SQL, you may check other limitations here. To scale this, you’d need to split the export into multiple files then trigger parallel imports into staging tables or handle them sequentially with automation.

  2. Temporarily scale cloud SQL during the batch run is a solid middle ground. Instead of permanently increasing CPU/RAM (which isn’t cost-effective), you can scale up just before your batch starts and scale down afterward. 

  3. For optimization of dataflow inserts, you may check these guides and best practices:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

 

 

 


@mcbsalceda wrote:

To scale this, you’d need to split the export into multiple files then trigger parallel imports into staging tables or handle them sequentially with automation


Are you saying one staging table per CSV and run the import simultaneously? Whether that's supported in Cloud sql?