SQL Spark connection in juptyerlab

logan_branson · 04-14-2024 10:55 PM

Hi,

I have a Dataproc cluster and using the web interface (Jupyterlab) inside the cluster, I am trying to read a table from SQL server hosted using Cloud SQL and loading it into a pyspark dataframe.

When I am trying to run the query to write the table into SQL server, it is giving a 'Data Source Not Found' error.

Below is the code I am trying to run:

server_name = "jdbc:sqlserver://<servername>"
database_name = "name"
url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "table"
username = "testserver"
password = "password" # Please specify password here
DF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
    .load()

Below is the error i am getting:

Py4JJavaError: An error occurred while calling o125.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark.
Caused by: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

Can you guide me on what might be going wrong here?

PS: I am not submitting a job in dataproc cluster, I am trying to read a table in SQL server in a jupyter notebook inside the cluster.

Thanks

ms4446

Hi @logan_branson ,

The "Data Source Not Found" error you're encountering is likely due to the missing SQL Server JDBC driver. Here's how to fix it:

1. JDBC Driver Installation:

Download the appropriate mssql-jdbc.jar driver (compatible with your SQL Server version) from Microsoft's website: https://learn.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server?view=sql-ser.... Then, add it to your Dataproc cluster using:

 

gcloud dataproc clusters add-jar <cluster-name> --jar=gs://<your-bucket>/mssql-jdbc.jar

2. Correct PySpark Configuration:

Use the jdbc format and ensure your connection details are accurate:

server_name = "jdbc:sqlserver://<your_cloud_sql_instance_connection_name>"
database_name = "name"
url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "table"
username = "testserver"
password = "password"

DF = spark.read.format("jdbc") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .load()

Cloud SQL Auth Proxy: Consider using it for improved security and connectivity.
Network Connectivity: Verify that your Dataproc cluster can reach your Cloud SQL instance.
Error Handling: Implement try-except blocks in your PySpark code to catch and handle potential errors gracefully.

Alternative for New Clusters:

If you're creating a new cluster, you can automate driver installation using initialization actions. See Dataproc documentation: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions for details.

logan_branson

Hi,

thanks for the guidance but when I tried running the script to add the jar file it says the keyword is an invalid choice. Is there any alternative for this?

This is the error I am getting:

ERROR: (gcloud.dataproc.clusters) Invalid choice: 'add-jar'.
Maybe you meant:
  gcloud dataproc autoscaling-policies
  gcloud dataproc batches
  gcloud dataproc clusters
  gcloud dataproc jobs
  gcloud dataproc node-groups
  gcloud dataproc operations
  gcloud dataproc workflow-templates

ms4446

Hi @logan_branson ,

Apologies for the confusion with the previous instruction. It seems there was an error in the command suggested for adding the JDBC driver to your Dataproc cluster. The add-jar command isn't valid within the gcloud CLI, as you've noticed.

Here is a different approach to ensure the SQL Server JDBC driver is accessible within your cluster:

1. Using the --jars Flag on Cluster Creation:

If you are in a position to create a new cluster, you can include the JDBC driver during the setup using the --jars flag. Here's how you can do it:

 

gcloud dataproc clusters create <cluster-name> \
--region <region> \
--jars gs://<your-bucket>/mssql-jdbc.jar \
--optional-components=JUPYTER \
--enable-component-gateway

Replace <cluster-name>, <region>, and <your-bucket> with your specifics.

2. Manually Uploading the JDBC Driver:

If creating a new cluster isn't feasible, you can manually upload the JDBC driver to your existing cluster. Upload the mssql-jdbc.jar file to the /usr/lib/spark/jars directory on each node of your cluster. You can do this via SSH:

 
# SSH into each node
gcloud compute ssh <node-name>

# Once SSH'd in, use gsutil to copy the jar from Cloud Storage
sudo gsutil cp gs://<your-bucket>/mssql-jdbc.jar /usr/lib/spark/jars/

Replace <node-name> and <your-bucket> with the appropriate values for your setup.

3. Restart Required Services:

After uploading the jar, you may need to restart any services that require the JDBC driver. This ensures that the newly added driver is recognized:

sudo systemctl restart spark

Additional Tip: Ensure your firewall and network settings allow your Dataproc cluster to communicate with your SQL Server instance. Using the Cloud SQL Auth Proxy could also be beneficial for managing these connections securely.

Once you have the JDBC driver properly configured and in place, adjust your PySpark script as previously mentioned and try running it again.