I'm trying to connect to Snowflake from Google Cloud Dataproc Serverless (Batch) Spark job (Spark 3.1 on Scala 2.12, Snowflake JDBC 3.13.30, Spark Snowflake connector 2.11.3) but I get connectivity issues:
23/05/03 18:49:45 ERROR RestRequest: Stop retrying since elapsed time due to network issues has reached timeout. Elapsed: 120,138(ms), timeout: 60,000(ms)
Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error. Message: Exception encountered for HTTP request: Connect to XXX.snowflakecomputing.com:443 [XXX.snowflakecomputing.com/34.107.221.154] failed: connect timed out.
I tried running it using default Dataproc solution (the one where you create clusters yourself) and there the problem doesn't occur there. Are there additional firewall rules which prevent Spark Serverless from reaching the outside world?
The error message you received indicates a connectivity issue between your Spark job running on Google Cloud Dataproc Serverless and Snowflake. Here are a few things you can check and try to resolve the issue:
Firewall rules: Make sure there are no additional firewall rules that prevent the Spark job from reaching the Snowflake service. You mentioned that the problem does not occur when using the default Dataproc solution, so it's possible that there are additional firewall restrictions in the serverless environment. Check your networking configuration and consult with your network administrator to ensure the necessary connectivity is allowed1.
Network troubleshooting with SnowCD: Snowflake provides a network diagnostic tool called SnowCD. You can use SnowCD to evaluate and troubleshoot your network connection to Snowflake. It can be helpful during the initial configuration process and for on-demand troubleshooting. Refer to Snowflake's documentation on how to use SnowCD to verify your network connection
Please refer to the following link: https://docs.snowflake.com/developer-guide/jdbc/jdbc-configure
3. Proxy server settings: If you are using a proxy server to connect to Snowflake, make sure the proxy settings are correctly configured. There are two ways to specify a proxy server with the Snowflake JDBC driver: by setting Java system properties or by including the proxy host and port information in the JDBC connection string or the Properties
object passed to the DriverManager.getConnection()
method0Propertiesobject passed to
the
DriverManager.getConnection() method. \n\nBoth techniques are documented below","pub_date":null}}
.
If you choose to set the proxy system properties in your code, you can use the System.setProperty
method to set the necessary properties. Here's an example:
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", "proxyHost Value");
System.setProperty("http.proxyPort", "proxyPort Value");
System.setProperty("https.proxyHost", "proxyHost HTTPS Value");
System.setProperty("https.proxyPort", "proxyPort HTTPS Value");
System.setProperty("http.proxyProtocol", "https");
Hi, this sounds like a very generic solution:
1. Can you create firewall rules in GCP which aim directly at Spark Serverless while omitting Dataproc? They're running in the same network.
2. Can you use SnowCD in Spark Serverless instance?
3. We don't use proxy
For a list of Dataproc Serverless Spark supported runtime releases see the following link:
https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions
I know this page very well, but it doesn't apply to any of my follow-up questions.
1. Firewall rules are configured to restrict access to Spark web UIs within the same VM. They are not configurable.
2. SnowCD is not currently available with Serverless Spark.
I was having a similar issue with connecting from Dataproc Serverless to an on-prem storage appliance. Setting egress firewall rules did not work. Tried using a custom container and that also did not work. It is very frustrating that this limitation does not seem to be documented anywhere. Finally, solved this by creating a Cloud NAT (needed because Dataproc Serverless nodes do not have external IPs) in addition to making sure that proper egress firewall rules are in place.
I understand that you encountered difficulties connecting from Dataproc Serverless to an on-prem storage appliance and that setting egress firewall rules and using a custom container did not resolve the issue. It can be frustrating when such limitations are not clearly documented.
In your case, you were able to solve the problem by creating a Cloud NAT and ensuring that the proper egress firewall rules were in place. Cloud NAT allows instances without external IP addresses, such as Dataproc Serverless nodes, to communicate with resources outside of the Google Cloud environment. By configuring Cloud NAT and ensuring the appropriate egress firewall rules are set up, you were able to establish the connection between Dataproc Serverless and your on-prem storage appliance.