I'm working on a data migration personal project where I need to transfer multiple tables from Microsoft SQL Server AdventureWorks2019 database to Google BigQuery using Dataflow SQL Server to BQ template. The goal is to use the SQL to BQ or JDBC to BQ template not cloud shell or python. My pipeline fails with the following errors:
"""
2024-06-29 22:12:13.949 CST Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (The TCP/IP connection to the host xx.xx.xx.xxx, port 1433 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".) org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1792) org.apache.beam.fn.harness.FnApiDoFnRunner.access$3000(FnApiDoFnRunner.java:143) org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.output(FnApiDoFnRunner.java:2650) org.apache.beam.sdk.transforms.MapElements$2.processElement(MapElements.java:151) org.apache.beam.sdk.transforms.MapElements$2$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213) org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.multiplexElements(BeamFnDataInboundObserver.java:158) org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:537) org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150) org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115) java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.sql.SQLException: Cannot create PoolableConnectionFactory (The TCP/IP connection to the host xx.xx.xx.xxx, port 1433 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".) org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:653) org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:531) org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:731) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.getConnection(JdbcIO.java:1526) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.processElement(JdbcIO.java:1539) Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host xx.xx.xx.xxx, port 1433 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.". com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:237) com.microsoft.sqlserver.jdbc.SQLServerException.convertConnectExceptionToSQLServerException(SQLServerException.java:288) com.microsoft.sqlserver.jdbc.SocketFinder.findSocket(IOBuffer.java:2574) com.microsoft.sqlserver.jdbc.TDSChannel.open(IOBuffer.java:715) com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3427) com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:3077) com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2919) com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1787) com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1229) org.apache.commons.dbcp2.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:52) org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:374) org.apache.commons.dbcp2.BasicDataSource.validateConnectionFactory(BasicDataSource.java:106) org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:649) org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:531) org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:731) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.getConnection(JdbcIO.java:1526) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn.processElement(JdbcIO.java:1539) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1792) org.apache.beam.fn.harness.FnApiDoFnRunner.access$3000(FnApiDoFnRunner.java:143) org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.output(FnApiDoFnRunner.java:2650) org.apache.beam.sdk.transforms.MapElements$2.processElement(MapElements.java:151) org.apache.beam.sdk.transforms.MapElements$2$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213) org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.multiplexElements(BeamFnDataInboundObserver.java:158) org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:537) org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150) org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115) java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:829) { insertId: "k9av4id28fi" labels: {4} logName: "projects/xxxx-xxxxxxx-project/logs/dataflow.googleapis.com%2Fjob-message" receiveTimestamp: "2024-06-30T04:12:15.797585260Z" resource: {2} severity: "ERROR" textPayload: "Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (The TCP/IP connection to the host xx.xx.xx.xxx, port 1433 has failed. Error: "connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".) org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39) org.apache.beam.sdk.io.jdbc.JdbcIO$ReadFn$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1792) org.apache.beam.fn.harness.FnApiDoFnRunner.access$3000(FnApiDoFnRunner.java:143) org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.output(FnApiDoFnRunner.java:2650) org.apache.beam.sdk.transforms.MapElements$2.processElement(MapElements.java:151) org.apache.beam.sdk.transforms.MapElements$2$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:803) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348) org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275) org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213) org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.multiplexElements(BeamFnDataInboundObserver.java:158) org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:537) org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150) org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115) java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:829) " timestamp: "2024-06-30T04:12:13.949046311Z" }
"""
My connection URL:
"""
jdbc:sqlserver://xx.xx.xx.xxx:1433;databaseName=AdventureWorks2019;user=xxxxxxx;password=xxxxxxxx;
"""
Steps I have taken:
job flow:
The data migration job involves transferring data from the AdventureWorks2019 database on a local SQL Server (my Windows computer) to Google BigQuery. The steps include: SQL Server Configuration: Configure the SQL Server for remote connections. Ensure firewall rules allow external access. GCP Dataflow Job Creation: Use my Mac computer to create and configure the Dataflow job. The job reads data from the SQL Server via JDBC. Data Transfer Process: The job transfers data to a Google Cloud Storage (GCS) bucket. Data is then loaded into Google BigQuery
When working on data migration projects involving Dataflow to transfer data from SQL Server to BigQuery, encountering errors such as "Cannot create PoolableConnectionFactory" and "PERMISSION_DENIED" can be frustrating. Understanding the root cause of these errors is important for effective troubleshooting and resolution.
The "Cannot create PoolableConnectionFactory" error indicates that Dataflow, running on Google Cloud, is struggling to establish a stable connection to your SQL Server. This is often due to a timeout, which suggests potential network or firewall issues that need to be addressed. On the other hand, the "PERMISSION_DENIED" error usually arises when the Dataflow service account does not have the necessary permissions to access either the SQL Server or the BigQuery destination.
To resolve these errors, it is important to address both network and firewall configurations as well as ensure the appropriate permissions are in place.
1. Network and Firewall Configuration
SQL Server Configuration: Start by double-checking that the SQL Server Browser service is running. Ensure that the correct TCP/IP port (1433 by default) is open and configured in SQL Server Configuration Manager. Additionally, verify that the SQL Server is listening on the correct IP address, which is likely your computer's internal IP.
Windows Firewall: Create an inbound rule in Windows Firewall to specifically allow incoming connections to port 1433 (TCP). If you are using a third-party firewall, configure it similarly to allow this traffic.
Cloud Firewall (If Applicable): If your SQL Server is behind a cloud-based firewall, ensure that the necessary port is opened for incoming traffic from the IP ranges used by Google Cloud Dataflow. This step is crucial to allow Dataflow to communicate with your SQL Server.
Network Connectivity: Confirm that there is a clear network path between your Google Cloud project and your SQL Server. If your SQL Server is on a home network, you might need to set up port forwarding on your router to facilitate this connection.
2. Permissions
SQL Server Permissions: Ensure that the SQL Server user specified in the JDBC connection string has sufficient permissions to read from the tables you want to migrate. It might be beneficial to create a dedicated user for the migration with read-only access to minimize security risks.
BigQuery Permissions: Verify that the Dataflow service account has the necessary roles, including BigQuery Admin, BigQuery Data Editor, and Storage Admin (if you are using GCS as an intermediate step). These roles are essential for Dataflow to read from SQL Server and write to BigQuery.
Service Account Key: If you are using a service account key file for authentication, make sure it is correctly specified in the Dataflow job parameters. This will ensure that Dataflow can authenticate properly when accessing both SQL Server and BigQuery.
3. Additional Considerations
SQL Server Version: Check that the version of SQL Server you are using is compatible with the JDBC driver included in your Dataflow template. Compatibility issues can sometimes lead to connection problems.
Driver Compatibility: Ensure that the JDBC driver included in the Dataflow template is up-to-date and suitable for your SQL Server version. Using an outdated or incompatible driver can result in connection errors.
Debugging Tips
Detailed Logging: Enable verbose logging in Dataflow to gain more insights into the connection errors. Detailed logs can help pinpoint the exact cause of the connection issues and provide clues for troubleshooting.
Connectivity Test: If you are unable to resolve the issue, try connecting to your SQL Server from a Google Compute Engine VM instance. This test can help determine if the issue is with your local network setup or if it is specific to the Dataflow configuration.
If you are setting up the JDBC connection programmatically within the Dataflow template, double-check the parameters to ensure accuracy. Below is an example of how you might configure the JDBC connection in your Dataflow template:
JdbcIO.read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create("com.microsoft.sqlserver.jdbc.SQLServerDriver", connectionUrl)
.withUsername(username)
.withPassword(password))
// ... (rest of your pipeline)