Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GCP Databricks and Cloud Pub/Sub

Has anybody used GCP Databricks to process events from Cloud Pub/Sub? Just wondering if I need to create a service account and assign a pub/sub subscriber role to subscribe to the messages.  Any guidance on this would be much appreciated. 

 

Thanks

 

0 4 2,086
4 REPLIES 4

Thanks for your question, Shiva.

The authentication piece for using Pub/Sub in the deployed Databricks edition on GCP should be similar for all supported GCP products and services. You may find this guide for connecting to BigQuery from GCP Databricks helpful. 

As for reading from Pub/Sub in Databricks, a couple of questions for you:

  • Which Spark version do you have?
  • Are you trying Spark Structured Streaming (with Spark DataFrame and DataSet) or Spark Streaming (using the DStream API with Spark RDDs) with Pub/Sub?
  • Which of PySpark / Java / Scala / SQL / R do you use? 

Hi Tianzi,

Thanks for your response. I'm using Spark 3.1.2 and Pyspark. 

Does pubsub support structured streaming? If not I'll have to use DStream. 

Thanks Shiva for your reply! Structured Streaming for Pub/Sub is not there yet. But I gave this OSS Spark Pub/Sub connector a try. It still works if you use my fork and follow the README there. My PySpark job submitted to a Dataproc cluster (version 1.5, project access set to allow API access to all or select GCP services) ran successfully. Are you submitting Spark jobs to Dataproc too?

Be sure to:

  • use Python ≤3.7.12 when you build the connector .egg file
  • set the `decodeData` option to True when you use the PubsubUtils class to create a stream
  • Spark 2.4.x

I also wanted to point you to Spark Structured Streaming support for Pub/Sub Lite. This Medium article describes how to use Pub/Sub Lite as a source with Spark Structured Streaming on Databricks.