Why does Cloud DataFlow require a VPC if it is Ser...

gcp_learner · 08-24-2021 07:04 PM

Would like to know why Cloud DataFlow requires a VPC as it is serverless? And how is the VPC used as there is no explicit assignment of IP addresses?

TrickSumo

Your default VCP (and new ones) is a logically isolated section for you in GCP. All your resources even serverless are allocated inside that VPC.

Even serverless resources use VMs internally which are provisioned automatically. Those recourses need at least private IP address.

Hence VPC is needed for even serverless.

gcp_learner

Thank you for your response. Since there is no explicit assignment of IP addresses from the defined subnets (unlike in the case of Compute Engine VMs). Is the IP address management and allocation completely transparent and managed by Google?

gcp_learner

Also note that I am able to use pub/sub, cloud functions ... without creating a VPC and also without the default VPC in the project.

alexmoore

There are a couple of things to consider here. First 'serverless' as a term refers to any service where you, as the consumer of the service, no longer need to worry about 'servers' in any traditional sense, you don't need to think about or perform deployment, setup, scaling, patching etc any servers. However, all 'serverless' services underneath do at some point have some compute capability supporting them - i.e. CPU, ram and storage, which of course means typically a server, the key thing is you just don't need to worry about it.

WIth that in mind, how that underlying compute is provisioned and made available to the serverless service varies from service to service and within a service. Many run as a platform outside your own VPC, but some can also leverage your VPC to provide more flexibility when it comes to data privacy and integration. In the case of DataFlow the pipeline (Apache Beam) management and engine runs outside, but when you run a batch job the 'workers' that are deployed to execute the jobs are actually deployed as ephemeral (temporary) instances within your VPC. While it is deploying servers to handle processing, you don't really need to worry about them individually at all. There are some settings around which network to use should you need to control that and what IP addressing to use (public vs private), but other than that the service takes care of the entire lifecycle of the workers and scales them as needed - hence being classed as 'serverless'.

Hope that helps, any other questions just ask.

Why does Cloud DataFlow require a VPC if it is Serverless