Hi,
I am getting an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.jdbc. : java.lang.ClassNotFoundException: mssql-jdbc-12.4.0.jre11.jar
The main file (main.py):
spark = SparkSession.builder.appName('my_app').getOrCreate()
connection_string = f'jdbc:sqlserver://1.2.3.4:1433;databaseName=my_db;'
properties = { 'user':'my_user', 'password':'my_password' }
df = spark.read.jdbc(
url=connection_string,
table='my_table',
properties=properties
)
The gcloud command:
gcloud dataproc batches submit pyspark \
--batch my_batch main.py \
--jars mssql-jdbc-12.4.0.jre11.jar \
--properties driver=mssql-jdbc-12.4.0.jre11.jar
The error message java.lang.ClassNotFoundException: mssql-jdbc-12.4.0.jre11.jar indicates that the Dataproc cluster cannot find the MSSQL JDBC driver JAR file, as this driver is not included in the default Dataproc image.
To address this error, you've correctly used the --jars flag to specify the driver JAR file when submitting the Dataproc batch job. However, the --properties flag in your command incorrectly specifies the JAR file name. This flag should be used to set Spark properties, and in this context, specifying the driver's class name would be more appropriate. But since you're using PySpark, this is typically not required.
Here's the updated gcloud command:
gcloud dataproc batches submit pyspark \
--batch my_batch main.py \
--jars mssql-jdbc-12.4.0.jre11.jar
If you still encounter issues:
Additional tips:
spark.driver.class
property.For example:import pyspark
spark = pyspark.sql.SparkSession.builder.appName('my_app').getOrCreate()
spark.conf.set('spark.driver.class', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')
# Read data from MSSQL Server
df = spark.read.jdbc(
url=connection_string,
table='my_table',
properties=properties
)