DataFlow template MongoDB to Bigquery (CDC)

Jamiil · 05-30-2023 09:48 AM

Hi

I am using mongodb to bigquery cdc template and I am able to create a successful job.

However when the mongodb documents has a new column the schema to write in bigquery changes and the pipeline fails. how do i create a pipeline that adjust the schema to use and write in the same bigquery table.

Also, I have to keep a python script running in a VM so I can do the streaming and I find it cummbersome, because when you close the VM everything stops. is there any alternative on how to run the script that streams data from mongodb so it is not seperate from the dataflow pipeline ? something like i will provide the stream script that send message to pubsub using the container google create to launch my dataflow pipeline..

I would really appreciate your help to solve the issue. I already spent so many days on it.
Thanks beforehand.

ms4446

For the first issue where the MongoDB document schema changes and causes the pipeline to fail:

BigQuery has a feature called Schema Auto-detection. When you load data into BigQuery and you don't know the data's schema, you can enable automatic schema detection. In this case, BigQuery makes a best-effort attempt to automatically infer the data's schema. However, please be aware that schema auto-detection could have limitations.

Another way to handle this is by using a flexible schema design in BigQuery, where you create your table schema to have a nested/repeated field that can handle extra fields in MongoDB documents.

When designing your pipeline, you should also consider adding error handling to ensure that when a new field is added to the MongoDB document, it doesn't cause the pipeline to fail. For example, you could catch the error, log it, and send the document with the new field to a dead-letter queue for later analysis and processing.

For the second issue regarding the python script running in a VM:

Instead of running your script on a VM that you have to manually manage, you could consider using a serverless solution. Google Cloud offers a few different options for this:

Google Cloud Functions: This allows you to write single-purpose functions that are attached to cloud events emitted from your cloud services. You could write a function that runs whenever new data is added to MongoDB, and then sends this data to your Pub/Sub topic.
Google Cloud Run: This is a managed compute platform that automatically scales your stateless containers. If your script can be packaged into a Docker container, Cloud Run could be a good solution.
Google Kubernetes Engine (GKE): If your workload is more complex, and you need more control and flexibility, you might consider using GKE. This requires more setup and management, but it gives you a lot of flexibility.

For all these solutions, you would need to modify your script so it can be triggered by an event or a request, rather than continuously running.

Finally, you could also consider using Google Cloud's Dataflow service, which is designed for running big data processing pipelines. This would probably require a more significant refactor of your code, but it could be a good long-term solution if you anticipate needing to process large amounts of data.

Jamiil

I have not found a solution yet.