I'm trying to build a dataflow pipeline to fetch data from sql server (its on a vm on compute engine), do some processing and push results to bigquery.
I decided to use a flex template for this operation and I've been having the hardest time trying to read from sql server using the ReadFromJdbc function when the I run a job on dataflow. The pipeline runs successfully on my machine but fails with a cryptic message when it runs on data flow.
Here is the error:
INFO 2025-03-12T18:15:10.717630Z INFO:apache_beam.runners.dataflow.dataflow_runner:Pipeline has additional dependencies to be installed in SDK worker container, consider using the SDK container image pre-building workflow to avoid repetitive installations. Learn more on https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
INFO 2025-03-12T18:15:10.747540Z Traceback (most recent call last):
INFO 2025-03-12T18:15:10.747697Z File "/template/main2.py", line 85, in <module>
INFO 2025-03-12T18:15:10.747907Z run()
INFO 2025-03-12T18:15:10.747972Z File "/template/main2.py", line 29, in run
INFO 2025-03-12T18:15:10.748104Z p.run()
INFO 2025-03-12T18:15:10.748152Z File "/usr/local/lib/python3.10/site-packages/apache_beam/pipeline.py", line 618, in run
INFO 2025-03-12T18:15:10.748422Z return self.runner.run_pipeline(self, self._options)
INFO 2025-03-12T18:15:10.748474Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 502, in run_pipeline
INFO 2025-03-12T18:15:10.748731Z self.dataflow_client.create_job(self.job), self)
INFO 2025-03-12T18:15:10.748786Z File "/usr/local/lib/python3.10/site-packages/apache_beam/utils/retry.py", line 298, in wrapper
INFO 2025-03-12T18:15:10.748970Z return fun(*args, **kwargs)
INFO 2025-03-12T18:15:10.749017Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 732, in create_job
INFO 2025-03-12T18:15:10.749286Z self.create_job_description(job)
INFO 2025-03-12T18:15:10.749334Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 824, in create_job_description
INFO 2025-03-12T18:15:10.749633Z resources = self._stage_resources(job.proto_pipeline, job.options)
INFO 2025-03-12T18:15:10.749685Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 643, in _stage_resources
INFO 2025-03-12T18:15:10.749929Z staged_resources = resource_stager.stage_job_resources(
INFO 2025-03-12T18:15:10.749983Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/portability/stager.py", line 413, in stage_job_resources
INFO 2025-03-12T18:15:10.750201Z self.stage_artifact(
INFO 2025-03-12T18:15:10.750250Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 1107, in stage_artifact
INFO 2025-03-12T18:15:10.750610Z self._dataflow_application_client._gcs_file_copy(
INFO 2025-03-12T18:15:10.750668Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 545, in _gcs_file_copy
INFO 2025-03-12T18:15:10.750888Z self._uncached_gcs_file_copy(from_path, to_path)
INFO 2025-03-12T18:15:10.750935Z File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 563, in _uncached_gcs_file_copy
INFO 2025-03-12T18:15:10.751155Z total_size = os.path.getsize(from_path)
INFO 2025-03-12T18:15:10.751203Z File "/usr/local/lib/python3.10/genericpath.py", line 50, in getsize
INFO 2025-03-12T18:15:10.765426Z return os.stat(filename).st_size
INFO 2025-03-12T18:15:10.765536Z FileNotFoundError: [Errno 2] No such file or directory: '/tmp/beam-pipeline-tempmg8_pdsm/tmpd6oe_67j'
INFO 2025-03-12T18:15:11.136336Z python failed with exit status 1
ERROR 2025-03-12T18:15:11.136446Z Error: Template launch failed: exit status 1
I suspect that the there is an issue with the JDBC driver? I tried to add the mssql-jdbc-9.4.1.jre11.jar driver to the classpath in my docker image but this didn't change anything. How can I resolve this?
Thanks