Solved: MONGODB TO BIGQUERY - Connection and Transform

rrotter · 04-23-2025 06:02 AM

I'm new to data engineering and using MongoDB, so I'd like some help, please.

I have a MongoDB database that needs to be migrated to BigQuery. The goal is to:

Send the data to BigQuery;
Periodically update the data that has already been migrated and add new data;
Transform this data (partition and cluster).

I initially implemented the Dataflow batch template MongoDB to BigQuery template, but it is WRITE_APPEND and this caused the target tables to have millions of repeated records.

Later I tried to use the MongoDB to BigQuery template (Stream), but it gave an error:

Failed to read the result file : gs://dataflow-staging-southamerica-east1-287781428749/staging/template_launches/2025-04-22_06_30_53-16695916874292364341/operation_result with error message: Calling GetObjectMetadata with file "/bigstore/dataflow-staging-southamerica-east1-287781428749/staging/template_launches/2025-04-22_06_30_53-16695916874292364341/operation_result/": cloud.bigstore.ResponseCode.ErrorCode::OBJECT_NOT_FOUND: No such object: dataflow-staging-southamerica-east1-287781428749/staging/template_launches/2025-04-22_06_30_53-16695916874292364341/operation_result/ [google.rpc.error_details_ext] { message: "No such object: dataflow-staging-southamerica-east1-287781428749/staging/template_launches/2025-04-22_06_30_53-16695916874292364341/operation_result/" details { [type.googleapis.com/google.rpc.DebugInfo] { stack_entries: "com.google.net.rpc3.client.RpcClientException: APPLICATION_ERROR;cloud.bigstore (...)

Questions:

What would be the best approach to the problem using open source tools? Use the Debezium connector, for example?
Does the 'MongoDB to BigQuery template (Stream)' template need a connector with MongoDB to work?
1. Would this template solve my problem? Or part of it?
2. I would also like to confirm: it is also WRITE_APPEND, correct?

Any help is welcome, as I have been looking for solutions for a few weeks now.

Thank you.

Snoshone07

Hi @rrotter thanks for sharing your case , this is actually a pretty common scenario when working with MongoDB and BigQuery. Let me walk you through your questions step by step:

1. Does the streaming template need a connector?

Yes, the MongoDB to BigQuery (Stream) template requires a real-time change data source — like a MongoDB change stream. This means your MongoDB instance needs to be set up with change streams enabled (which requires a replica set), and Dataflow must be able to access that stream of changes.

2. What’s causing the "No such object..." error?

That error usually happens due to a failure during pipeline initialization or due to permission issues with the Dataflow staging bucket. A few things to check:

Make sure the service account has Storage Object Viewer and Storage Object Creator roles for the dataflow-staging-* bucket.
Ensure the bucket exists and hasn’t been recently deleted or modified.

3. Does the template use WRITE_APPEND?

Yes , both the batch and streaming templates use WRITE_APPEND by default. So if you want to avoid duplicates, you’ll need to implement deduplication logic — either directly in the pipeline (e.g., using GROUP BY in Dataflow or filtering with BigQueryIO.Write.Method.STREAMING_INSERTS) or by writing to a staging table and using a MERGE statement in BigQuery.

4. Is using Debezium a good alternative?

Absolutely. Debezium with Kafka Connect is a solid choice, especially if you already have the infrastructure or can spin it up using Docker or Kubernetes. It lets you capture CDC (change data capture) events from MongoDB and send them to a Kafka topic, which you can then consume and write to BigQuery using a custom logic (e.g., with deduplication via MERGE).

5. Is there a simpler, faster way to do this?

If you're looking for something quicker to set up and less operationally heavy, you could consider a solution like Windsor.ai. It can connect MongoDB to BigQuery, handle both the initial load and incremental updates, and gives you control over deduplication — without the need to manage Kafka or complex Dataflow pipelines.

Hope this helps!

View solution in original post

Snoshone07