Hello Google Cloud Community,
I am working on migrating an app from AWS to GCP.
The main objective is to design a real-time dashboard system intended to operate within a latency window of 7-15 seconds from data event (Kafka) to visualization (custom app). The system is aimed at providing up-to-date data, insights, and analytics.
Currently, the app:
- Reads from Kafka topic CDC data (Inserts, Updates, Deletes) from N operative tables.
- Writes to N Amazon Aurora DB tables.
- A microservice pulls data every 7 seconds from Aurora and pushes to REDIS, which is connected in real time to the visualization layer (custom app). It is business-critical that the data freshness does not exceed 15 seconds.
They are having some performance issues and increasing of cost, so, we want to migrate to a new architecture withing GCP.
I would appreciate any insights or suggestions on the feasibility and optimization of the following proposed architecture using Google Cloud Platform services. We are concerned about latency and costs:
Architecture Overview:
Architecture Overview alternative B:
Same than before but using DataFlow to model the data using append-only strategy. Example:
Key Requirements:
I am particularly interested in understanding:
Your expertise and any case studies or examples of similar implementations would be greatly appreciated.
Thank you in advance for your insights and assistance!
It sounds like you are trying to roll-your-own CDC processing system. While this can be done, I wonder if there is an opportunity here for you to leverage an existing CDC processing engine? How strict is the requirement to read CDC messages from Kafka? How are they being put into the Kafka topic today? Do you already have a CDC engine and does it have BigQuery integration ... (eg. source database -> CDC product XYZ (eg. Google Datastream) -> BigQuery). What is the volume of CDC messages per period of time? What is the format of the CDC messages?
Perhaps the notion that these are CDC messages is un-important in your story. I was assuming that the CDC messages were used to reconcile a BigQuery table and keep it up to date, but maybe these are "just records" to you and you ONLY want to append them to a table?
As you sense ... there are quite a few options/permutations and data pipelines to ingest data is something that MOST Google Cloud consumers need and hence Google has a ton of experience with it. Does your enterprise already have a relationship with Google? I'd suggest you contact your Google account representative ... they can likely arrange a call with you with a customer engineer or architect and holistically look at your needs and make specific recommendations.
However ... to answer your specific questions ... Yes ... Dataflow can consume from Kafka and Yes ... Dataflow can insert into BigQuery ... all with minimal latency and with horizontal scaling to accommodate even the highest volumes. Receiving a fan-in of records from sourced from multipled upstream tables and concurrently writing to many BigQuery downstream tables is not an issue at all for a Dataflow pipeline.