Solved: Building a realtime data ingestion pipeline from m...

RC1 · 10-19-2022 10:56 AM

Hi folks , so i want to build a realtime CDC ( change data capture ) data pipeline which captures cdc data from mongodb databases and dumps it to Big Query (in flat format). How do I develop such kind of architecture using various GCP services ? How do you guys build an ETL pipeline from Mongo DB to Google Big Query ? I am mostly stuck at data ingestion part because as of now there is no service offered by GCP which does realtime mongo cdc ingestion. Any help is highly appreciated

RC1

@kylemurley

I had went through this solution last week. But here the problem is how to bring full load + cdc data from mongodb to pubsub/GCS with scaling. Debezium is one of the solution but we are looking for a managed service where scaling is handled , because we have 100 GB size mongo databases. My only hurdle is that I want a managed or a scalable service which can capture the full load + cdc data from mongo and dump it to GCS or Pubsub. After that I can handle data using Dataproc and other ETL services.
I personally think that , this is one of the missing part which GCP needs to implement so that they can serve different data platform / data analytics usecases.

View solution in original post

kylemurley

Hi RC1, I was just having this conversation with a customer of mine.

These are a few resources I shared. Data Fusion and DataFlow have integrations. There are also ways to accomplish this with other 3rd party tools but these are GCP and Mongo 1st party offerings that you may wish to check into. Best wishes, -Kyle Murley Google Cloud Customer Engineer, San Diego, CA US.

RC1

@kylemurley

I had went through this solution last week. But here the problem is how to bring full load + cdc data from mongodb to pubsub/GCS with scaling. Debezium is one of the solution but we are looking for a managed service where scaling is handled , because we have 100 GB size mongo databases. My only hurdle is that I want a managed or a scalable service which can capture the full load + cdc data from mongo and dump it to GCS or Pubsub. After that I can handle data using Dataproc and other ETL services.
I personally think that , this is one of the missing part which GCP needs to implement so that they can serve different data platform / data analytics usecases.

venkateshramasa

Hi @RC1, Bigquery now supports CDC. Pls check the release notes here

https://cloud.google.com/bigquery/docs/release-notes

Building a realtime data ingestion pipeline from mongodb to Big Query