What and where does the Dataflow service backend r...

bhaveshg · 03-07-2023 03:29 AM

I have a dataflow streaming job and I am thinking of enabling the Streaming engine option available with it. I am reading from Pub/Sub and writing to Big query. I know that when I enable streaming Engine, pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. Can you help where does this Dataflow Service backend lies and what exactly it does? Is this also a separate individual VM allocated to perform windowing operations?

Joevanie

Dataflow’s Streaming Engine moves pipeline execution out of the worker VMs and into a backend service that is managed by Google Cloud. This backend service handles windowing operations, shuffling data, and state storage more efficiently than worker VMs. It also allows for more responsive autoscaling and smoother scaling. The Streaming Engine is not a separate individual VM, but a part of the Dataflow service itself. Aside from the official docs on using Streaming Engine, there is a great article about it.

bhaveshg

Hi @Joevanie, Thanks for your reply. I know what advantages does streaming Engine brings in. The only thing I am concerned is whether the Dataflow service lies within the organization VPC/subnet or is it a different entity outside the VPC/subnet.

Sathish_tu_ns

I enabled streaming engine and i still see that VM worker instance getting created,
If the pipeline is moved out of the worker VM's to dataflow service backend, why do we need VM instances for.

thanks in advance.

What and where does the Dataflow service backend reside in Streaming Engine. Is this a VM?