Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Facing the error which connecting cloudSQL MSSQL server using JDBC in dataproc

when i am running the same pyspark job to read the cloud sql data using jdbc in pycharm on local machine where is it able to fetch the data however when i am trying to run the same job in dataproc i am facing this error .

 

: com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "SQL Server returned an incomplete response. The connection has been closed. ClientConnectionId:431e9cba-6f40-4239-b537-5822e2d7b127".

 

we have tried to use the higher version of jar files but still facing the issue .

We are using a cloud sql SQL Server 2022 Standard with lowest configuration  1vcpt 3.75 gb ram 20gb ssd. opened the allowed the authorized network to all 0.0.0.0/0 .

using dataproc cluster configs given below

Master nodeSingle Node (1 master, 0 workers)
Machine typen1-standard-4
Number of GPUs0
Primary disk typepd-standard
Primary disk size100GB
0 3 886
3 REPLIES 3

Based on the error message and details provided, it appears the issue lies in establishing a secure SSL connection between Dataproc and Cloud SQL. To address this, consider the following steps:

  1. Verify Driver and Connectivity:

    • Ensure you're using the correct MSSQL JDBC driver version, compatible with both the Java version on your Dataproc cluster and your SQL Server version. The latest driver version is often recommended.
    • Confirm the driver JAR file's accessibility to the Dataproc cluster, either by placing it in the same directory as your PySpark script or in an accessible Cloud Storage bucket.
    • Recheck Cloud SQL's authorized networks settings. While opening to 0.0.0.0/0 is broad, for security, it's better to use a more specific IP range, such as the Dataproc cluster's internal IP range.
    • Investigate any potential firewall rules or network configurations that might be impeding the connection.
  2. Validate SSL Configuration:

    • The error suggests an SSL encryption issue. Ensure your Cloud SQL instance is configured for SSL, and consider setting "Require SSL" in the instance settings.
    • Configure your Dataproc cluster to trust the Cloud SQL instance's SSL certificate by adding it to the cluster's truststore.
    • Review your PySpark script for SSL-related parameters like encrypt=true or trustServerCertificate=true, and adjust them as needed.
  3. Additional Tips:

    • Examine Dataproc cluster logs for more detailed error information, focusing on JDBC connection or SSL handshake issues.
    • Use tools like nc to test connectivity and SSL handshake between Dataproc and Cloud SQL.
    • Test your PySpark script against a local SQL Server instance to validate connection parameters and isolate the issue.
  4. Dataproc Cluster Configuration:

    • While your current Dataproc configuration seems adequate for testing, consider scaling resources for larger jobs or production workloads. Adjustments might include more worker nodes or a more powerful machine type.
  5. Logging and Monitoring:

    • Enable detailed logging for the JDBC driver and Dataproc jobs to gain deeper insights.
    • Utilize Google Cloud's monitoring and logging tools for performance tracking.
  6. Security and Best Practices:

    • Avoid using 0.0.0.0/0 in production environments due to security risks. Opt for restricted IP ranges.
    • Follow best practices for secure JDBC connections and sensitive data management, such as encrypted connections and secure credential handling.

Thanks for replying to my post , i have tested most of these things after seeing it in your replies to previous posts related to similar issue by other on the community. 

I have retried and here i am writing my output for the pointwise suggestions you gave.  the issue is not getting resolved not sure if that's possible but could i request a call with you or someone from your team where i can share my screen and try fixing this issue ? 

  1. Verify Driver and Connectivity:

    • I have now tried using the latest version mssql-jdbc-12.4.2.jre11.jar and getting the error "com.microsoft.sqlserver.jdbc.SQLServerException: "encrypt" property is set to "true" and "trustServerCertificate" property is set to "true" but the driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption: Error: SQL Server returned an incomplete response. The connection has been closed. ClientConnectionId:62ac13c1-494d-438b-8ea6-d0fe88fe2e2c." 
    • I have also added encrypt & trustServerCertificate in pyspark script ,the url is  "url", "jdbc:sqlserver://hostname;database=dbname;encrypt=true;trustServerCertificate=true;
    • JAR is placed in the cloud storage bucket and it is accessible .
    • I have tried with the specific ip range but when the error was not recurring so I am trying with 0.0.0.0/0 (knowing this is not best practice) but just to test the connection.so when we have allow all this should also work.
    •  I am trying this in the default network and there is no firewall rule at network or org level to deny the traffic. the tcp on port 1433 is opened for ingres and Egres.
  2. Validate SSL Configuration:

    • updated Cloud SQL instance and enabled "Allow only SSL connections".
    • Not sure about how to Configure Dataproc cluster to trust the Cloud SQL instance's SSL certificate by adding it to the cluster's truststore. so could not try this.
    • SSL-related parameters like encrypt=true or trustServerCertificate=true, are set like show above in url.
  3. Additional Tips:

    • Dataproc cluster logs is showing the above error msg along with following line java.io.IOException: SQL Server returned an incomplete response. The connection has been closed. ClientConnectionId:259a6c1c-a575-4c88-b018-e267e81fc7d9
      at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.readInternal(IOBuffer.java:948)
      at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.read(IOBuffer.java:935)
      at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1197)
      at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1183)
      at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
      at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
      at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
      at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
      at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
      at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1843)
    • Use tools like nc to test connectivity and SSL handshake between Dataproc and Cloud SQL. 
    • nc was not giving any output so tried this amit_shrivastava@cluster-77ec-m:~$ curl -vv telnet://hostname:1433
      * Trying hostname:1433...
      * Connected to hostname (hostname) port 1433 (#0)
    • Test your PySpark script against a local SQL Server instance to validate connection parameters and isolate the issue.
    • the pyspark script is working on local pycharm and return the output result . there I am using mssql_jdbc_8_4_1_jre8jar and its working fine but on dataproc is niether working with this jar nor with the latest jar mssql-jdbc-1242jre11jar

Based on your latest findings, here are some additional suggestions:

  1. JDBC Driver and Connection String:

    • Since the error persists with the latest JDBC driver, consider testing with an intermediate version of the driver, if you haven't already. Sometimes, the very latest version might introduce new issues.
    • Double-check the connection string parameters. Since it works locally with a different driver version, try matching those parameters as closely as possible in your Dataproc environment.
  2. SSL Certificate and Truststore Configuration:

    • Configuring the Dataproc cluster to trust the Cloud SQL instance's SSL certificate is a crucial step. You can do this by downloading the SSL certificate from your Cloud SQL instance and then using an initialization action in Dataproc to add this certificate to the Java keystore on your cluster nodes.
    • Here's a basic outline of how to do this:
      • Download the Cloud SQL instance's SSL certificate.
      • Create a Dataproc initialization action script that adds this certificate to the Java keystore. This script will run on each node when the cluster is created.
      • Use the gcloud dataproc clusters create command with the --initialization-actions flag to specify your script when creating your Dataproc cluster.
  3. Network Connectivity:

    • Since nc didn't provide output, it's worth ensuring that there's no network issue between Dataproc and Cloud SQL. The fact that curl could connect suggests that the basic network path is open, but it doesn't confirm if SSL handshake can be successfully completed.
    • Consider using a more advanced network diagnostic tool or method to specifically test SSL connectivity and handshake.
  4. Driver Compatibility with Dataproc Environment:

    • There might be compatibility issues between the JDBC driver and the specific Java or PySpark environment on Dataproc. Ensure that the Java version on Dataproc matches the requirements of the JDBC driver.
    • Since the script works locally with a different JDBC driver version, try using that same version in Dataproc, ensuring all other environment variables and configurations are consistent.
  5. Reviewing Dataproc Cluster Logs:

    • Dive deeper into the Dataproc cluster logs. Sometimes, the logs can contain more detailed information about the SSL handshake process and where exactly it's failing.
  6. Testing with a Simplified Setup:

    • As a diagnostic step, try setting up a new, minimal Dataproc cluster and a simple JDBC connection test script. This can help isolate the problem from any other complexities of your current setup.
  7. Consulting Google Cloud Support:
  • Given the complexity and persistence of the issue, it might be beneficial to reach out to Google Cloud Support. They can provide more targeted assistance, especially since they can access detailed logs and configurations that are not publicly available.