Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Replicate data from MySQL to BigQuery

I want to replicate data from MySQL to BigQuery by utilizing DataStream and Pub/Sub in the data transfer process. I aim to perform stream processing of the data from DataStream using Cloud Function before replicate it into BigQuery. anyone suggest if this is possible to achieve without using GCS and Dataflow? I don't find any implementation article regarding this.

MySQL ----> DataStream ----> Pub/Sub ----> Cloud Function ----> BigQuery

Solved Solved
0 4 745
1 ACCEPTED SOLUTION

Is your MySQL is residing in Cloud SQL in GCP ?  Also at the moment DataStream to Pub/Sub is not supported, You can track the progress in this issue tracker here: https://issuetracker.google.com/278842503 and a Google community post here: https://www.googlecloudcommunity.com/gc/Serverless/Use-Pub-Sub-as-a-Datastream-destination/m-p/45423...

 

If you still need advice for orginizing a process using Google Cloud, I would suggest to contact TAM(Technical Account Manager) for your use case.

View solution in original post

4 REPLIES 4

Is your MySQL is residing in Cloud SQL in GCP ?  Also at the moment DataStream to Pub/Sub is not supported, You can track the progress in this issue tracker here: https://issuetracker.google.com/278842503 and a Google community post here: https://www.googlecloudcommunity.com/gc/Serverless/Use-Pub-Sub-as-a-Datastream-destination/m-p/45423...

 

If you still need advice for orginizing a process using Google Cloud, I would suggest to contact TAM(Technical Account Manager) for your use case.

Great! Thank you for the solution. Can you also suggest if we can migrate specific rows from Cloud SQL to a BigQuery table using DataStream? Does DataStream provide functionality to filter out specific rows during migration?

 

Difficulty Connecting MySQL to BigQuery or Google Cloud Storage

Hi,

I'm currently facing challenges connecting MySQL with BigQuery. During the testing phase, I encountered the following issue:
The connection to the source could not be established.

I've tried various approaches, but I haven't been able to resolve the problem. Could you please provide guidance or share the best practices for this integration?

Any help would be greatly appreciated.

Thank you in advance!

Best regards,
Luan Silva.

Hi @Nikita_G your proposed architecture:

MySQL → DataStream → Pub/Sub → Cloud Function → BigQuery

is definitely creative, but there are a few key things to keep in mind that might help you choose the best path forward:

Is that flow feasible without GCS and Dataflow?


Technically, it can be done , but here's the thing: DataStream isn't designed to publish directly to Pub/Sub.
Typically, DataStream writes to Cloud Storage or connects with Dataflow, which then loads the data into BigQuery.

Current limitations to be aware of:

  • There’s no native integration between DataStream and Pub/Sub.

  • You’d need to build a workaround to read files from GCS and forward them to Pub/Sub — which kind of brings GCS back into the picture anyway.

Simpler (and recommended) alternatives:

Option 1: The classic GCP architecture


MySQL → DataStream → GCS → Dataflow → BigQuery

This is the most common and well-supported setup on GCP. It’s scalable and reliable , though it may require a bit more initial setup.

Option 2: No-fuss ETL (no code, no GCS)


MySQL → Windsor.ai → BigQuery


With tools like Windsor.ai, you can connect MySQL as a data source and automatically send the data to BigQuery, without having to manage any infrastructure in between. You can also add basic transformation logic if needed.

Perfect if you're looking for a fully managed solution without building and maintaining custom pipelines.

Option 3: Custom CDC setup


MySQL → Debezium (CDC) → Pub/Sub → Cloud Function → BigQuery


This route gives you more flexibility and control, but you’d need to handle and maintain the components yourself.
It's ideal for hybrid environments or cases where you need precise control over change events.

Final recommendation:


If you just need a daily or scheduled sync (not real-time), a tool like Windsor.ai can save you tons of time and configuration. But if real-time streaming and full control are what you’re after, then going with Debezium or the official GCP flow is your best bet.

Top Solution Authors