Apache Beam pipeline deployed with a flex template...

Sra123 · 03-12-2025 12:51 PM

I'm trying to build a dataflow pipeline to fetch data from sql server (its on a vm on compute engine), do some processing and push results to bigquery.

I decided to use a flex template for this operation and I've been having the hardest time trying to read from sql server using the ReadFromJdbc function when the I run a job on dataflow. The pipeline runs successfully on my machine but fails with a cryptic message when it runs on data flow.

Here is the error:

INFO 2025-03-12T18:15:10.717630Z INFO:apache_beam.runners.dataflow.dataflow_runner:Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
INFO 2025-03-12T18:15:10.747540Z Traceback (most recent call last):
INFO 2025-03-12T18:15:10.747697Z File "/template/main2.py", line 85, in <module>
INFO 2025-03-12T18:15:10.747907Z run()
INFO 2025-03-12T18:15:10.747972Z File "/template/main2.py", line 29, in run
INFO 2025-03-12T18:15:10.748104Z p.run()
INFO 2025-03-12T18:15:10.748152Z File "/usr/local/lib/python3.10/site-packages/apache_beam/pipeline.py", line 618, in run
INFO 2025-03-12T18:15:10.748422Z return self.runner.run_pipeline(self, self._options)
INFO 2025-03-12T18:15:10.748474Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 502, in run_pipeline
INFO 2025-03-12T18:15:10.748731Z self.dataflow_client.create_job(self.job), self)
INFO 2025-03-12T18:15:10.748786Z File "/usr/local/lib/python3.10/site-packages/apache_beam/utils/retry.py", line 298, in wrapper
INFO 2025-03-12T18:15:10.748970Z return fun(*args, **kwargs)
INFO 2025-03-12T18:15:10.749017Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 732, in create_job
INFO 2025-03-12T18:15:10.749286Z self.create_job_description(job)
INFO 2025-03-12T18:15:10.749334Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 824, in create_job_description
INFO 2025-03-12T18:15:10.749633Z resources = self._stage_resources(job.proto_pipeline, job.options)
INFO 2025-03-12T18:15:10.749685Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 643, in _stage_resources
INFO 2025-03-12T18:15:10.749929Z staged_resources = resource_stager.stage_job_resources(
INFO 2025-03-12T18:15:10.749983Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/portability/stager.py", line 413, in stage_job_resources
INFO 2025-03-12T18:15:10.750201Z self.stage_artifact(
INFO 2025-03-12T18:15:10.750250Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 1107, in stage_artifact
INFO 2025-03-12T18:15:10.750610Z self._dataflow_application_client._gcs_file_copy(
INFO 2025-03-12T18:15:10.750668Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 545, in _gcs_file_copy
INFO 2025-03-12T18:15:10.750888Z self._uncached_gcs_file_copy(from_path, to_path)
INFO 2025-03-12T18:15:10.750935Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 563, in _uncached_gcs_file_copy
INFO 2025-03-12T18:15:10.751155Z total_size = os.path.getsize(from_path)
INFO 2025-03-12T18:15:10.751203Z File "/usr/local/lib/python3.10/genericpath.py", line 50, in getsize
INFO 2025-03-12T18:15:10.765426Z return os.stat(filename).st_size
INFO 2025-03-12T18:15:10.765536Z FileNotFoundError: [Errno 2] No such file or directory: '/tmp/beam-pipeline-tempmg8_pdsm/tmpd6oe_67j'
INFO 2025-03-12T18:15:11.136336Z python failed with exit status 1
ERROR 2025-03-12T18:15:11.136446Z Error: Template launch failed: exit status 1

I suspect that the there is an issue with the JDBC driver? I tried to add the mssql-jdbc-9.4.1.jre11.jar driver to the classpath in my docker image but this didn't change anything. How can I resolve this?

Thanks

ms4446

The discrepancy between local pipeline execution and Dataflow job failures often arises due to differences in execution environments, especially related to external dependencies such as JDBC drivers. Local machines typically have these drivers available in their classpaths, but Dataflow workers run within isolated Docker containers, requiring explicit packaging of all necessary dependencies.

Although the initial FileNotFoundError encountered during resource staging is somewhat unusual for JDBC driver-related issues, it likely indicates a broader dependency management challenge. Dataflow Flex Templates, which depend on Docker images, necessitate careful inclusion of dependencies. Specifically, JDBC drivers must be placed within the /opt/apache/beam/jars/ directory inside the Docker image to ensure they're accessible by Dataflow workers.

To resolve these issues, consider the following detailed steps:

Explicit Driver Placement:
- Ensure the JDBC driver (e.g., mssql-jdbc-9.4.1.jre11.jar) is correctly copied into /opt/apache/beam/jars/ within your Dockerfile.
Docker Image Rebuild:
- After updating the Dockerfile, rebuild and push the Docker image to your container registry.
Flex Template Creation:
- Accurately specify the Docker image and provide correctly structured metadata when creating the Flex Template.
Validate JDBC URL:
- Verify that the JDBC connection string adheres to the standard SQL Server format:

jdbc:sqlserver://<vm-ip-address>:1433;databaseName=<your_database>

GCS Bucket Access:
- Confirm that the Dataflow service account has the necessary permissions for reading from and writing to the staging and temporary GCS buckets.
Network and Firewall Configuration:
- Ensure that port 1433 is open on your SQL Server VM and that Dataflow workers have appropriate network access, possibly adjusting firewall settings or using Private Service Connect.

Apache Beam pipeline deployed with a flex template fails to run on Dataflow