Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Facing issue on Dataproc serverless while connecting to databases

I can connect and fetch data from the SQL server, and Postgres databases using a JDBC connection on Dataproc cluster service. But when I am trying to do the same on Dataproc serverless using the same code and jar file I get the below error.

POSTGRES

: org.postgresql.util.PSQLException: The connection attempt failed. Caused by: java.net.SocketTimeoutException: Connect timed out

I am using this command "gcloud dataproc batches submit --project <project-id> --region <region-id> pyspark --batch postgrestest gs://<bucket-name>/code/filename.py --version 2.0 --jars gs://bucket-name/jars/postgresql_42_7_1.jar --subnet default --service-account <service-account> "

 

SQLSERVER

: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host x.x.x.x, port 1433 has failed. Error: "Connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.

I am using this command "gcloud dataproc batches submit --project <project-id> --region <region-id> pyspark --batch sqlservertest gs://<bucket-name>/code/filename.py --version 2.0 --jars gs://bucket-name/jars/mssql_jdbc_8_4_1_jre8.jar --subnet default --service-account <service-account> "

 

I am also facing the same issue for the Mongdb as well.

MongoDB

: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.mongodb.spark.sql.DefaultSource Unable to get public no-arg constructor

Caused by: java.lang.NoClassDefFoundError: scala/Serializable

Command: "gcloud dataproc batches submit --project <project-id> --region <region> pyspark --batch mongodbtest gs://<bucket-name>/code/filename.py --version 2.0 --jars gs://bucket-name/jars/mongo-spark-connector_2.12.jar --properties spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 --subnet default --service-account <service-account>"

I am not able to figure out how to resolve this issue on Dataproc serverless.

 

@ms4446 would you be able to suggest any solution for the above issues?

3 5 1,984
5 REPLIES 5

Please note the following errors:

PostgreSQL and SQL Server: "Connect timed out"

This error typically indicates a network configuration issue, such as restrictive firewall rules, unreachable database servers, or servers not actively listening on the specified ports.

MongoDB: "NoClassDefFoundError: scala/Serializable"

This error suggests a mismatch between the Scala version used by your Mongo Spark Connector and the Scala version within your Dataproc Serverless environment, pointing to a dependency conflict.

Troubleshooting Strategy

Begin with a thorough inspection of the connection strings in your code. Ensure the accuracy of database hostnames/IP addresses, ports, and database names. Crucially, include any necessary connection parameters like SSL settings or authentication modes required by your databases. This step often resolves issues without the need for more complex troubleshooting.

Network configuration is a common culprit for connectivity issues:

  • Firewall Rules: Verify firewall settings at your database instances, within your Google Cloud VPC, and any intermediary security layers. Ensure these rules allow traffic on the default ports for PostgreSQL (5432), SQL Server (1433), and MongoDB (27017), and check for any service-specific network policies that could impact connectivity.

  • Database Accessibility: Ensure that your database servers are operational, accessible on the network, and configured to accept connections.

  • VPC Settings: Review your VPC subnet and routing. Confirm that Dataproc Serverless and the databases can communicate directly, or ensure proper routes/NAT are established for cross-network access.

  • VPC Peering: For databases located in separate VPCs, set up VPC peering for seamless communication.

Resolve Dependency Conflicts (MongoDB)

Address Scala compatibility between your Mongo Spark Connector and the Dataproc Serverless environment. Adjust connector versions or include additional Scala libraries as needed. The Mongo Spark Connector's official documentation is a valuable resource for compatible versions and troubleshooting tips.

Additional Techniques

  • Connection Verification: Use tools like telnet or nc (netcat) for basic network connectivity tests. Commands like telnet hostname port or nc -zv hostname port can quickly determine if a port is open and reachable, simulating the network path your Dataproc Serverless jobs would take.

  • Enhanced Logging: Augment your PySpark code with extensive logging. Adjust Spark logging options or system properties to increase verbosity, particularly for JDBC drivers, with settings such as --conf spark.executor.extraJavaOptions=-Dorg.slf4j.simpleLogger.defaultLogLevel=debug for more insightful logs.

  • JDBC Driver Tips: For JDBC-dependent use cases, delve into driver-specific connection properties (e.g., sslmode, currentSchema). Consult your JDBC drivers' documentation for detailed guidance and best practices.

Proactive Prevention for Consistent Connectivity

  • Infrastructure as Code (IaC): Leverage IaC tools like Terraform to manage and version-control your databases, subnets, firewall rules, and network components. This approach ensures configuration consistency and simplifies rollback processes.

  • Automated Connectivity Tests: Integrate basic connectivity tests into your CI/CD pipeline for early detection of misconfigurations. Utilize tools like curl, telnet, or custom scripts to automate these tests, enhancing system reliability through proactive measures.

Thank you for responding to my post. As mentioned I can establish connections and fetch data from the databases on the Dataproc cluster service but getting issues with the same on the Dataproc batches (serverless). 

I have tried the network configuration and additional techniques mentioned by you but I am still not able to resolve the issues.

I had similar issue, my workflows templates suddenly stop working due connection timeout with Maven.
Seem like a update happen and changed the networking behavior, any internet connection from cluster fails.

I created a NAT gateway to fix this connection issue.

Did you get this issue for the Dataproc batches (serverless) or something else?

Can you please elaborate a little more on the issue and fix you applied?

Did you manage to solve this issue? I also use Postgres and face the same issue. It's fine for smaller tables with exact same options and parameters, so it's definitely not firewall issue