Feasibility Real-Time analytics-dashboard Architec...

novas · 12-26-2023 05:56 AM

Hello Google Cloud Community,

I am working on migrating an app from AWS to GCP.

The main objective is to design a real-time dashboard system intended to operate within a latency window of 7-15 seconds from data event (Kafka) to visualization (custom app). The system is aimed at providing up-to-date data, insights, and analytics.

Currently, the app:

- Reads from Kafka topic CDC data (Inserts, Updates, Deletes) from N operative tables.
- Writes to N Amazon Aurora DB tables.
- A microservice pulls data every 7 seconds from Aurora and pushes to REDIS, which is connected in real time to the visualization layer (custom app). It is business-critical that the data freshness does not exceed 15 seconds.

They are having some performance issues and increasing of cost, so, we want to migrate to a new architecture withing GCP.

I would appreciate any insights or suggestions on the feasibility and optimization of the following proposed architecture using Google Cloud Platform services. We are concerned about latency and costs:

Architecture Overview:

Source: CDC from Kafka (Events):
- Events are directly streamed from Kafka to Google Dataflow, or perhaps directly to BQ using Kafka to BQ connector.
BigQuery as Primary Storage:
- Live Table: For storing recent data, receiving real-time insertion events (inserts, deletes, and updates).
- Historical Table: For maintaining historical data, updated (MERGE) every 5 minutes.
- Unified View: Combining data from both Live and Historical tables to provide a real-time comprehensive view.
Backend as a Microservice:
- Performing queries to the Unified View in BigQuery every 7 seconds.
- Using Redis for caching to optimize performance and reduce latency.
- Pushing processed data to the visualization layer.
Visualization Layer (Dashboard Application):
- Receiving and displaying data from the backend service, ensuring real-time data representation.

Architecture Overview alternative B:

Same than before but using DataFlow to model the data using append-only strategy. Example:

Table A in source
CDC -> Kafka
Kafka -> DataFlow ( divide the data into fact and dim tables in real time, using append-only strategy)
App query to data model using timestamp and pk to get the last true.
the rest of the system is the same.

Key Requirements:

Maintain a latency of 7-15 seconds from data ingestion to visualization.
Ensure scalability as the system grows to handle more than 1 million daily queries and increasing data volume.
Optimize costs associated with BigQuery.

I am particularly interested in understanding:

If the proposed architecture is viable for the latency requirements, numer of table( more than 30 all changing all the time) and the use case of real-time analytics.
If the proposed architecture can scale to handle automatic (or manual) confirmation of operations (like adding a CloudSQL, real-time critical, to the equation).
Best practices for optimizing each component of the architecture to maintain or improve latency.
Dataflow can handle streaming insert into multiples tables?
Multiple Dataflow jobs can insert into the same table?
Should we consider other tool as back-end like AlloyDB?
Any potential pitfalls or considerations I should be aware of when implementing this architecture on GCP.

Your expertise and any case studies or examples of similar implementations would be greatly appreciated.

Thank you in advance for your insights and assistance!

kolban

It sounds like you are trying to roll-your-own CDC processing system. While this can be done, I wonder if there is an opportunity here for you to leverage an existing CDC processing engine? How strict is the requirement to read CDC messages from Kafka? How are they being put into the Kafka topic today? Do you already have a CDC engine and does it have BigQuery integration ... (eg. source database -> CDC product XYZ (eg. Google Datastream) -> BigQuery). What is the volume of CDC messages per period of time? What is the format of the CDC messages?

Perhaps the notion that these are CDC messages is un-important in your story. I was assuming that the CDC messages were used to reconcile a BigQuery table and keep it up to date, but maybe these are "just records" to you and you ONLY want to append them to a table?

As you sense ... there are quite a few options/permutations and data pipelines to ingest data is something that MOST Google Cloud consumers need and hence Google has a ton of experience with it. Does your enterprise already have a relationship with Google? I'd suggest you contact your Google account representative ... they can likely arrange a call with you with a customer engineer or architect and holistically look at your needs and make specific recommendations.

However ... to answer your specific questions ... Yes ... Dataflow can consume from Kafka and Yes ... Dataflow can insert into BigQuery ... all with minimal latency and with horizontal scaling to accommodate even the highest volumes. Receiving a fan-in of records from sourced from multipled upstream tables and concurrently writing to many BigQuery downstream tables is not an issue at all for a Dataflow pipeline.

Feasibility Real-Time analytics-dashboard Architecture on GCP