Re: Error connecting to jdbc with pyspark in datap...

johntan213890 · 10-25-2023 02:36 AM

Hi,

I am trying to connect to a mssql server via jdbc with pyspark in dataproc.

I am getting an error:

py4j.protocol.Py4JJavaError: An error occurred while calling o79.jdbc. : java.lang.ClassNotFoundException: mssql-jdbc-12.4.0.jre11.jar

The main file (main.py):

spark = SparkSession.builder.appName('my_app').getOrCreate()
connection_string = f'jdbc:sqlserver://1.2.3.4:1433;databaseName=my_db;'
properties = { 'user':'my_user', 'password':'my_password' }
df = spark.read.jdbc(
    url=connection_string,
    table='my_table',
    properties=properties
)

The gcloud command:

gcloud dataproc batches submit pyspark \
--batch my_batch main.py  \
--jars mssql-jdbc-12.4.0.jre11.jar \
--properties driver=mssql-jdbc-12.4.0.jre11.jar

ms4446

The error message java.lang.ClassNotFoundException: mssql-jdbc-12.4.0.jre11.jar indicates that the Dataproc cluster cannot find the MSSQL JDBC driver JAR file, as this driver is not included in the default Dataproc image.

To address this error, you've correctly used the --jars flag to specify the driver JAR file when submitting the Dataproc batch job. However, the --properties flag in your command incorrectly specifies the JAR file name. This flag should be used to set Spark properties, and in this context, specifying the driver's class name would be more appropriate. But since you're using PySpark, this is typically not required.

Here's the updated gcloud command:

gcloud dataproc batches submit pyspark \
--batch my_batch main.py \
--jars mssql-jdbc-12.4.0.jre11.jar

If you still encounter issues:

Ensure the MSSQL JDBC driver JAR file is in a directory accessible when running the gcloud command.
Consider using a different version of the MSSQL JDBC driver that's compatible with the Java version on your Dataproc cluster.
Review the Dataproc cluster logs for additional details on the error.

Additional tips:

Make sure that the MSSQL JDBC driver JAR file is in the same directory as your PySpark script (main.py).
If you are using a cloud storage bucket to store your PySpark script and the MSSQL JDBC driver JAR file, make sure that the bucket is accessible to the Dataproc cluster.
You can also try specifying the driver class name in the PySpark script using the spark.driver.class property.For example:

import pyspark

spark = pyspark.sql.SparkSession.builder.appName('my_app').getOrCreate()
spark.conf.set('spark.driver.class', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')

# Read data from MSSQL Server
df = spark.read.jdbc(
    url=connection_string,
    table='my_table',
    properties=properties
)

Error connecting to jdbc with pyspark in dataproc