SQL Spark connection in juptyerlab - Page 2

logan_branson · 04-14-2024 10:55 PM

Hi,

I have a Dataproc cluster and using the web interface (Jupyterlab) inside the cluster, I am trying to read a table from SQL server hosted using Cloud SQL and loading it into a pyspark dataframe.

When I am trying to run the query to write the table into SQL server, it is giving a 'Data Source Not Found' error.

Below is the code I am trying to run:

server_name = "jdbc:sqlserver://<servername>"
database_name = "name"
url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "table"
username = "testserver"
password = "password" # Please specify password here
DF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
    .load()

Below is the error i am getting:

Py4JJavaError: An error occurred while calling o125.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark.
Caused by: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

Can you guide me on what might be going wrong here?

PS: I am not submitting a job in dataproc cluster, I am trying to read a table in SQL server in a jupyter notebook inside the cluster.

Thanks