Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

PubSub to Dataproc Serverless

Is it possible to read events from GCP Pub/Sub (not Lite) using Spark (Structured Streaming) in Dataproc (Serverless)? 

So far, I only found articles that either show how to read from GCP Pub/Sub Lite. If this is not possible on GCP with Dataproc, do you recommend using Databricks?

0 1 310
1 REPLY 1

Reading data from standard Google Cloud Pub/Sub using Spark Structured Streaming in Dataproc (Serverless) is not natively supported due to the absence of a direct connector. While a connector exists for Pub/Sub Lite, this service was deprecated on September 24, 2024, for new customers and will be fully shut down by March 18, 2026. Existing customers who had not used Pub/Sub Lite in the 90 days prior to its deprecation also lost access on that date. Google recommends transitioning to standard Pub/Sub or Managed Service for Apache Kafka as alternatives.

The Pub/Sub Lite Spark connector, an open-source Java library, enables Spark Structured Streaming to work with Pub/Sub Lite as both an input and output source. This connector is compatible with platforms like Dataproc and Databricks. However, due to limitations in Spark’s processing model, it does not support seeking to the beginning of a backlog. Instead, it allows seeking from a specific Unix epoch timestamp to replay all messages.

For standard Pub/Sub, a practical alternative is Apache Beam, which provides robust IO capabilities for Pub/Sub in both Java and Python. Beam pipelines can run on Spark runners in Dataproc or on Google Cloud Dataflow, which offers additional features for seamless integration with Pub/Sub.