I am running a PySpark streaming job on Google Cloud Dataproc to read data from Pub/Sub and write it to BigQuery. However, my job fails with the error:
--------
java.lang.ClassNotFoundException: Failed to find data source: pubsub.
Please find packages at <URL Removed by Staff>
------------
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StringType, StructType, DoubleType
spark = SparkSession.builder \
.appName("ClickstreamProcessor") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-streaming-pubsub_2.12:2.4.0") \
.getOrCreate()
df = spark.readStream.format("pubsub").option("subscription", subscription).load()
Hi @pallakki.
Welcome to Google Cloud Community!
The java.lang.ClassNotFoundException error you encountered suggests that the Google Cloud Pub/Sub connector is not present in the classpath of the Spark application on the Dataproc cluster. The issue with using spark.jars.packages may stem from an incorrect package name or version for the standard Pub/Sub connector, network access restrictions preventing the download, or other configuration issues.
To address this issue, follow these recommended steps:
In addition, You can also refer to this GitHub repository as a baseline for troubleshooting, along with this blog that covers managing and configuring the external libraries or packages needed for your Apache Spark application on Google Cloud Dataproc.
For a deeper investigation, you can reach out to Google Cloud Support. When reaching out, include detailed information and relevant screenshots of the errors you’ve encountered. This will assist them in diagnosing and resolving your issue more efficiently.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.