Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataproc PySpark Job Fails with "Failed to find data source: pubsub

I am running a PySpark streaming job on Google Cloud Dataproc to read data from Pub/Sub and write it to BigQuery. However, my job fails with the error:
--------
java.lang.ClassNotFoundException: Failed to find data source: pubsub.
Please find packages at <URL Removed by Staff>
------------

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StringType, StructType, DoubleType

spark = SparkSession.builder \
.appName("ClickstreamProcessor") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-streaming-pubsub_2.12:2.4.0") \
.getOrCreate()

df = spark.readStream.format("pubsub").option("subscription", subscription).load()


0 1 96
1 REPLY 1

Hi @pallakki.

Welcome to Google Cloud Community!

The java.lang.ClassNotFoundException error you encountered suggests that the Google Cloud Pub/Sub connector is not present in the classpath of the Spark application on the Dataproc cluster. The issue with using spark.jars.packages may stem from an incorrect package name or version for the standard Pub/Sub connector, network access restrictions preventing the download, or other configuration issues.

To address this issue, follow these recommended steps:

  1. Check the intended Pub/Sub service: You should first confirm whether the application should connect to standard Pub/Sub or Pub/Sub Lite. If Pub/Sub Lite is the desired service, you should use the Maven coordinates com.google.cloud:pubsublite-spark-sql-streaming:<latest_version> 1.0.0). Then verify whether the application should connect to standard Pub/Sub or Pub/Sub Lite.
  2. Identify the correct Maven coordinates: If using Pub/Sub Lite, use com.google.cloud:pubsublite-spark-sql-streaming:<latest_version>. For standard Pub/Sub, further investigation into the appropriate connector or client library integration might be needed.
  3. Try using the --packages flag: Submit the PySpark job using the gcloud dataproc jobs submit pyspark command with the --packages flag and the correct Maven coordinates. Ensure the Dataproc cluster has internet access.
  4. Consider using the --jars flag: If the --packages approach fails (e.g., due to network restrictions), download the connector JAR (with dependencies for Pub/Sub Lite) and upload it to a Cloud Storage bucket. Then, submit the job using the --jars flag pointing to the Cloud Storage path.
  5. Explore creating an Uber JAR: For more complex scenarios or persistent dependency issues, consider building a shaded Uber JAR containing the application and all its dependencies.

In addition, You can also refer to this GitHub repository as a baseline for troubleshooting, along with this blog that covers managing and configuring the external libraries or packages needed for your Apache Spark application on Google Cloud Dataproc.

For a deeper investigation, you can reach out to Google Cloud Support. When reaching out, include detailed information and relevant screenshots of the errors you’ve encountered. This will assist them in diagnosing and resolving your issue more efficiently.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.