Solved: Do we really need Dataflow for stream processing?

morpheus · 02-03-2023 03:49 PM

This is somewhat of an open ended question but I am trying to understand do we really need Dataflow or an equivalent technology like Apache Flink etc. for stream processing. Our problem: we are a large scale enterprise company. we need to process sales orders in real-time to compute various financial metrics. Our data volume is what I would call medium. we certainly don't receive billions or even millions of events per second. Our challenge is that the data model is screwed up. Data is scattered across many tables. information is imperfect. To process a sales order we have to lookup many tables, apply complex rules to fill the information we don't have. Is Dataflow really the right choice for this use-case? These days whenever I hear stream processing the next thing I hear is Dataflow, Flink, Kafka (K-SQL) etc.

kolban

Great responses. I'm not sure if you are a Google Cloud customer today or not. Either way, I am getting the impression that you might benefit from a focused and detailed design discussion for your specific use case. A Google customer engineer (technical) would be able to sit down with you (in person or virtually) and start gathering all your requirements and work with you to flesh our some high level architectures taking account of all the distinct needs. As for merging multiple streams of incoming events, especially if they need to be time windowed together is really starting to sound like Dataflow (Apache Beam). Not only can it group events into time windows for processing, it can also handle concepts such as late arriving events (such as might occur if one of the feeds broke or stalled). While Apache Beam SQL is indeed a candidate in a potential solution, I'd likely lean towards starting with Beam itself (Java or Python as opposed to SQL). The API mechanisms are more mature and richer at this time.

Again, we seem to be in a high level design discussion here and I suggest that the public forum won't be nearly as productive as engaging with a Google customer engineer (use that phrase when talking with Google ... they'll know what you mean). The Google customer engineer and yourself can likely quickly identify what parts of your puzzles can be easily solved by Beam (and Google's managed version ... Dataflow) and which parts might be trickier ... and ... if trickier ... bring to bear Google's experience working with other clients who had similar issues. If the puzzles get even trickier, leveraging the customer engineer would be the way to bring in additional Google subject matter experts who would be dedicated beam specialists (if needed).

View solution in original post

kolban

I think we'll need to paint a broader picture. I'm hearing you say that you do indeed have events arriving. Are these events in Kafka, or PubSub or some other messaging system? Are they incoming REST requests? Are they micro batches of files or direct database inserts? All of these will influence our discussion. Next comes the concept of what processing has to be performed on an event when it arrives? I think I'm hearing that the event payload is going to be replicated or distributed across many tables. What is the nature of the back-end database? Is it BigQuery, Cloud SQL or something else?

Now lets talk generics ... to my ears, stream processing is the ingestion of data, its processing as part of a pipeline and then its disposition to live in a particular format at rest at the end. Many customers want as low a latency as possible from the time an event arrives at the enterprise to when it can be of value downstream. If we didn't use an stream processing engine, what would be the alternative? We could (I think) cause the ingested events to languish in a file or other storage media when they arrive and then batch process them. This would maximize latency. Alternatively, we could write ourselves an application that receives the incoming events, processes them and deposits them ... but if we wrote this application from scratch, we would effectively be re-creating what is provided by a stream processing engine today. I see a stream processing engine as a "platform" that you can use as a significant starting point for building stream processing solutions. As for choosing which stream engine to choose, you do indeed have a variety of stories ... many of which can run on Google Cloud. The default story from Google is Dataflow. Let's realize that Dataflow is a Google marketing name for "managed Apache Beam". The skills and techniques needed to build a stream pipeline are 100% open source Apache Beam. Dataflow is Google's serverless environment for getting your Apache Beam job running with as little fuss as possible with auto scaling and other useful features.

Let's turn our conversation over to "If not stream processing, what is the alternative?". Looking forward to hearing back.

morpheus

Thanks for the answer.

Are these events in Kafka, or PubSub or some other messaging system?

> we have not decided between kafka or pub/sub. lets assume its pub/sub to keep discussion focused on GCP which is what this forum is about.

Are they incoming REST requests?

> yes we can setup a pub/sub push subscription if that is what the question is asking.

Are they micro batches of files or direct database inserts?

> not sure what is meant here.

All of these will influence our discussion. Next comes the concept of what processing has to be performed on an event when it arrives?

> I think this is the part we are confused about and where we need guidance. Our event processing is complex. Characteristics of our problem:

1. Not all data is in one stream. we may need to combine and merge streams. For this I understand Dataflow SQL could be a great choice. It would merge streams and output a derived stream.

2. After that, to continue the processing we may have to lookup data in many tables (we haven't decided on what database we would use) etc. The question I asked on the forum stemmed from this. I understand Dataflow is a good choice when:

2a. the processing is confined to streaming data

2b. the processing logic outputs data that is a continuous function of input in a time window much like digital signal processing for whose familiar with it.

Is Dataflow a good choice when the processing logic has to lookup data in MySQL, Postgres and update those tables based on input? E.g., say input is sales orders. MySQL could be storing some statistics such as revenue. We need to update this table and the input we get is not perfect. we have to do a lot of "sweeping the floor" to fill missing information in the data.

I think I'm hearing that the event payload is going to be replicated or distributed across many tables. What is the nature of the back-end database? Is it BigQuery, Cloud SQL or something else?

TBD

kolban

Great responses. I'm not sure if you are a Google Cloud customer today or not. Either way, I am getting the impression that you might benefit from a focused and detailed design discussion for your specific use case. A Google customer engineer (technical) would be able to sit down with you (in person or virtually) and start gathering all your requirements and work with you to flesh our some high level architectures taking account of all the distinct needs. As for merging multiple streams of incoming events, especially if they need to be time windowed together is really starting to sound like Dataflow (Apache Beam). Not only can it group events into time windows for processing, it can also handle concepts such as late arriving events (such as might occur if one of the feeds broke or stalled). While Apache Beam SQL is indeed a candidate in a potential solution, I'd likely lean towards starting with Beam itself (Java or Python as opposed to SQL). The API mechanisms are more mature and richer at this time.

Again, we seem to be in a high level design discussion here and I suggest that the public forum won't be nearly as productive as engaging with a Google customer engineer (use that phrase when talking with Google ... they'll know what you mean). The Google customer engineer and yourself can likely quickly identify what parts of your puzzles can be easily solved by Beam (and Google's managed version ... Dataflow) and which parts might be trickier ... and ... if trickier ... bring to bear Google's experience working with other clients who had similar issues. If the puzzles get even trickier, leveraging the customer engineer would be the way to bring in additional Google subject matter experts who would be dedicated beam specialists (if needed).

morpheus

thanks. makes sense.

CatFix

Beam can handle Batch for Dataflow. powerful and is probably overkill but here is a tutorial that will walk you thru it. Batch Processing with Beam Tutorial on Udemy