Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Do we really need Dataflow for stream processing?

This is somewhat of an open ended question but I am trying to understand do we really need Dataflow or an equivalent technology like Apache Flink etc. for stream processing. Our problem: we are a large scale enterprise company. we need to process sales orders in real-time to compute various financial metrics. Our data volume is what I would call medium. we certainly don't receive billions or even millions of events per second. Our challenge is that the data model is screwed up. Data is scattered across many tables. information is imperfect. To process a sales order we have to lookup many tables, apply complex rules to fill the information we don't have. Is Dataflow really the right choice for this use-case? These days whenever I hear stream processing the next thing I hear is Dataflow, Flink, Kafka (K-SQL) etc.

Solved Solved
2 5 3,596
1 ACCEPTED SOLUTION

Great responses.  I'm not sure if you are a Google Cloud customer today or not.  Either way, I am getting the impression that you might benefit from a focused and detailed design discussion for your specific use case.  A Google customer engineer (technical) would be able to sit down with you (in person or virtually) and start gathering all your requirements and work with you to flesh our some high level architectures taking account of all the distinct needs.  As for merging multiple streams of incoming events, especially if they need to be time windowed together is really starting to sound like Dataflow (Apache Beam).  Not only can it group events into time windows for processing, it can also handle concepts such as late arriving events (such as might occur if one of the feeds broke or stalled).  While Apache Beam SQL is indeed a candidate in a potential solution, I'd likely lean towards starting with Beam itself (Java or Python as opposed to SQL).  The API mechanisms are more mature and richer at this time.

Again, we seem to be in a high level design discussion here and I suggest that the public forum won't be nearly as productive as engaging with a Google customer engineer (use that phrase when talking with Google ... they'll know what you mean).  The Google customer engineer and yourself can likely quickly identify what parts of your puzzles can be easily solved by Beam (and Google's managed version ... Dataflow) and which parts might be trickier ... and ... if trickier ... bring to bear Google's experience working with other clients who had similar issues.  If the puzzles get even trickier, leveraging the customer engineer would be the way to bring in additional Google subject matter experts who would be dedicated beam specialists (if needed).

View solution in original post

5 REPLIES 5