Using Custom VM instance images for worker instances in dataflow

Dataflow currently supports custom containers as per the below page,
https://cloud.google.com/dataflow/docs/guides/using-custom-containers

I want to know whether we can use our own VM image for starting the worker VM instances, currently the dataflow service account pulls the image for compute engine from "dataflow-service-producer-prod" project.

But we have restrictions to use only the images prepared within our org with the necessary patches that we need.

any help is much appreciated.

Solved Solved
0 3 592
2 ACCEPTED SOLUTIONS

Yes, Dataflow supports the use of custom containers. However, DF does not support the use of custom VM images for worker instances on Compute Engine

View solution in original post

In terms of Dataproc, you can indeed create custom images. The process involves using the generate_custom_image.py Python program, which creates a temporary Compute Engine VM instance with the specified Dataproc base image. The program then runs a customization script inside the VM instance to install custom packages and/or update configurations. After the customization script completes its task, the program shuts down the VM instance and creates a Dataproc custom image from the disk of the VM instance.

However, Dataflow does not currently support the use of custom VM images. While Dataflow does support custom container images, these are only for the Docker instances running inside the worker VM instances, and not for the worker VM instances themselves.

View solution in original post

3 REPLIES 3

Yes, Dataflow supports the use of custom containers. However, DF does not support the use of custom VM images for worker instances on Compute Engine

Hello, thanks for the clarification, 

does the same apply to DataProc(Spark streaming) also,  based on the below link, I understand that we can do customization on top of the existing image using https://github.com/GoogleCloudDataproc/custom-images/blob/master/generate_custom_image.py, but again we cant use our own os image for the workers of VM instance in compute engine.

https://cloud.google.com/dataproc/docs/guides/dataproc-images#generate_a_custom_image

also, can we apply customization in DataFlow VM images similar to what was provided in the Dataproc as mentioned above. 

Thanks again for your time.

In terms of Dataproc, you can indeed create custom images. The process involves using the generate_custom_image.py Python program, which creates a temporary Compute Engine VM instance with the specified Dataproc base image. The program then runs a customization script inside the VM instance to install custom packages and/or update configurations. After the customization script completes its task, the program shuts down the VM instance and creates a Dataproc custom image from the disk of the VM instance.

However, Dataflow does not currently support the use of custom VM images. While Dataflow does support custom container images, these are only for the Docker instances running inside the worker VM instances, and not for the worker VM instances themselves.