Hi,
I have a Dataproc cluster and using the web interface (Jupyterlab) inside the cluster, I am trying to read a table from SQL server hosted using Cloud SQL and loading it into a pyspark dataframe.
When I am trying to run the query to write the table into SQL server, it is giving a 'Data Source Not Found' error.
Below is the code I am trying to run:
server_name = "jdbc:sqlserver://<servername>"
database_name = "name"
url = server_name + ";" + "databaseName=" + database_name + ";"
table_name = "table"
username = "testserver"
password = "password" # Please specify password here
DF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.load()
Below is the error i am getting:
Py4JJavaError: An error occurred while calling o125.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark.
Caused by: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
Can you guide me on what might be going wrong here?
PS: I am not submitting a job in dataproc cluster, I am trying to read a table in SQL server in a jupyter notebook inside the cluster.
Thanks
Hi @logan_branson ,
The "Data Source Not Found" error you're encountering is likely due to the missing SQL Server JDBC driver. Here's how to fix it:
1. JDBC Driver Installation:
Download the appropriate mssql-jdbc.jar
driver (compatible with your SQL Server version) from Microsoft's website: https://learn.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server?view=sql-ser.... Then, add it to your Dataproc cluster using:
gcloud dataproc clusters add-jar <cluster-name> --jar=gs://<your-bucket>/mssql-jdbc.jar
2. Correct PySpark Configuration:
Use the jdbc
format and ensure your connection details are accurate:
server_name = "jdbc:sqlserver://<your_cloud_sql_instance_connection_name>"
database_name = "name"
url = server_name + ";" + "databaseName=" + database_name + ";"
table_name = "table"
username = "testserver"
password = "password"
DF = spark.read.format("jdbc") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
.load()
try-except
blocks in your PySpark code to catch and handle potential errors gracefully.Alternative for New Clusters:
If you're creating a new cluster, you can automate driver installation using initialization actions. See Dataproc documentation: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions for details.
Hi,
thanks for the guidance but when I tried running the script to add the jar file it says the keyword is an invalid choice. Is there any alternative for this?
This is the error I am getting:
ERROR: (gcloud.dataproc.clusters) Invalid choice: 'add-jar'.
Maybe you meant:
gcloud dataproc autoscaling-policies
gcloud dataproc batches
gcloud dataproc clusters
gcloud dataproc jobs
gcloud dataproc node-groups
gcloud dataproc operations
gcloud dataproc workflow-templates
Hi @logan_branson ,
Apologies for the confusion with the previous instruction. It seems there was an error in the command suggested for adding the JDBC driver to your Dataproc cluster. The add-jar
command isn't valid within the gcloud
CLI, as you've noticed.
Here is a different approach to ensure the SQL Server JDBC driver is accessible within your cluster:
1. Using the --jars
Flag on Cluster Creation:
If you are in a position to create a new cluster, you can include the JDBC driver during the setup using the --jars
flag. Here's how you can do it:
gcloud dataproc clusters create <cluster-name> \
--region <region> \
--jars gs://<your-bucket>/mssql-jdbc.jar \
--optional-components=JUPYTER \
--enable-component-gateway
Replace <cluster-name>
, <region>
, and <your-bucket>
with your specifics.
2. Manually Uploading the JDBC Driver:
If creating a new cluster isn't feasible, you can manually upload the JDBC driver to your existing cluster. Upload the mssql-jdbc.jar
file to the /usr/lib/spark/jars
directory on each node of your cluster. You can do this via SSH:
# SSH into each node
gcloud compute ssh <node-name>
# Once SSH'd in, use gsutil to copy the jar from Cloud Storage
sudo gsutil cp gs://<your-bucket>/mssql-jdbc.jar /usr/lib/spark/jars/
Replace <node-name>
and <your-bucket>
with the appropriate values for your setup.
3. Restart Required Services:
After uploading the jar, you may need to restart any services that require the JDBC driver. This ensures that the newly added driver is recognized:
sudo systemctl restart spark
Additional Tip: Ensure your firewall and network settings allow your Dataproc cluster to communicate with your SQL Server instance. Using the Cloud SQL Auth Proxy could also be beneficial for managing these connections securely.
Once you have the JDBC driver properly configured and in place, adjust your PySpark script as previously mentioned and try running it again.