Share the DB connection pool across all Dataflow w...

syed_bazmi · 10-18-2023 07:57 AM

We have 50 Google Cloud SQL instances, each running PostgreSQL. Each of these instances contains 100 individual databases.

Each Cloud SQL instance has a maximum connection limit of 1000 connections.

We are using Dataflow with a worker limit of 1000, and each Dataflow worker is configured with a connection pool using HikariCP. Specifically, each worker has a connection pool of size 10 for every database within the 50 Cloud SQL instances.

In the event of Dataflow scaling up, a new connection pool can be created for the same DB on the new worker machine, potentially creating additional connections to the Cloud SQL instances.

I have the following queries:

1) Can we share this connection pool along with its state across all the workers of Dataflow?

2) Is there any option in Dataflow to share objects across the workers of Dataflow?

3) Can we use side inputs to share the object between workers.

ms4446

No, it is not possible to share a connection pool along with its state across all the workers of Dataflow. Each worker has its own independent connection pool. This is because Dataflow workers run in separate virtual machines (VMs) and have their own separate JVMs. This means they do not share memory or state.
Yes, there are a few options in Dataflow to share read-only data across the workers of Dataflow:

Side inputs: Side inputs can be used to share read-only data between workers on a per-bundle basis. For example, you could create a side input containing a list of database connection strings.

While you can use side inputs to provide data to each worker, they are not suitable for sharing mutable objects or state like a connection pool. Side inputs are more for scenarios where you have a small dataset that you want to join with a larger stream of data. They are read-only and cannot be updated by the workers.

In general, it is best to avoid sharing mutable objects across workers in Dataflow. This is because it can lead to race conditions and other problems.

If you need to share data between workers, it is generally better to use a side input.

bohdanius

In addition to question above, are there any best practices how to manage single connection pool for every worker?

Like we have a pipeline with multiple DoFns, each interacts with CloudSQL. Should I strive to maintain single connection pool per worker (by making it static or singleton) or the recomended way is to handle individual connection pool per DoFn?

Share the DB connection pool across all Dataflow workers