Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GCP Spark Serverless to Snowflake connection issue

I'm trying to connect to Snowflake from Google Cloud Dataproc Serverless (Batch) Spark job (Spark 3.1 on Scala 2.12, Snowflake JDBC 3.13.30, Spark Snowflake connector 2.11.3) but I get connectivity issues:

23/05/03 18:49:45 ERROR RestRequest: Stop retrying since elapsed time due to network issues has reached timeout. Elapsed: 120,138(ms), timeout: 60,000(ms)

Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error. Message: Exception encountered for HTTP request: Connect to XXX.snowflakecomputing.com:443 [XXX.snowflakecomputing.com/34.107.221.154] failed: connect timed out.

I tried running it using default Dataproc solution (the one where you create clusters yourself) and there the problem doesn't occur there. Are there additional firewall rules which prevent Spark Serverless from reaching the outside world?

0 7 1,254
7 REPLIES 7

The error message you received indicates a connectivity issue between your Spark job running on Google Cloud Dataproc Serverless and Snowflake. Here are a few things you can check and try to resolve the issue:

  1. Firewall rules: Make sure there are no additional firewall rules that prevent the Spark job from reaching the Snowflake service. You mentioned that the problem does not occur when using the default Dataproc solution, so it's possible that there are additional firewall restrictions in the serverless environment. Check your networking configuration and consult with your network administrator to ensure the necessary connectivity is allowed​1​.

  2. Network troubleshooting with SnowCD: Snowflake provides a network diagnostic tool called SnowCD. You can use SnowCD to evaluate and troubleshoot your network connection to Snowflake. It can be helpful during the initial configuration process and for on-demand troubleshooting. Refer to Snowflake's documentation on how to use SnowCD to verify your network connection

Please refer to the following link: https://docs.snowflake.com/developer-guide/jdbc/jdbc-configure

      3. Proxy server settings: If you are using a proxy server to connect to Snowflake, make sure the proxy settings are correctly configured. There are two ways            to specify a proxy server with the Snowflake JDBC driver: by setting Java system properties or by including the proxy host and port information in the                 JDBC connection string or the Properties object passed to the DriverManager.getConnection() method​0Propertiesobject passed totheDriverManager.getConnection() method. \n\nBoth techniques are documented below","pub_date":null}}​.

If you choose to set the proxy system properties in your code, you can use the System.setProperty method to set the necessary properties. Here's an example:

System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", "proxyHost Value");
System.setProperty("http.proxyPort", "proxyPort Value");
System.setProperty("https.proxyHost", "proxyHost HTTPS Value");
System.setProperty("https.proxyPort", "proxyPort HTTPS Value");
System.setProperty("http.proxyProtocol", "https");

Hi, this sounds like a very generic solution:

1. Can you create firewall rules in GCP which aim directly at Spark Serverless while omitting Dataproc? They're running in the same network.

2. Can you use SnowCD in Spark Serverless instance?

3. We don't use proxy

For a list of Dataproc Serverless Spark supported runtime releases see the following link:

https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions 

I know this page very well, but it doesn't apply to any of my follow-up questions.

1. Firewall rules are configured to restrict access to Spark web UIs within the same VM. They are not configurable.

2. SnowCD is not currently available with Serverless Spark.

 

I was having a similar issue with connecting from Dataproc Serverless to an on-prem storage appliance. Setting egress firewall rules did not work. Tried using a custom container and that also did not work. It is very frustrating that this limitation does not seem to be documented anywhere. Finally, solved this by creating a Cloud NAT (needed because Dataproc Serverless nodes do not have external IPs) in addition to making sure that proper egress firewall rules are in place. 

I understand that you encountered difficulties connecting from Dataproc Serverless to an on-prem storage appliance and that setting egress firewall rules and using a custom container did not resolve the issue. It can be frustrating when such limitations are not clearly documented.

In your case, you were able to solve the problem by creating a Cloud NAT and ensuring that the proper egress firewall rules were in place. Cloud NAT allows instances without external IP addresses, such as Dataproc Serverless nodes, to communicate with resources outside of the Google Cloud environment. By configuring Cloud NAT and ensuring the appropriate egress firewall rules are set up, you were able to establish the connection between Dataproc Serverless and your on-prem storage appliance.