Google Cloud Platform - Vertex AI - Workbench Jupy...

cs4225 · 04-13-2022 03:12 PM

Hi All,

I am trying to connect to a SparkSession on Vertex AI's Workbench JupyterLab, but receive this error. Locally, my JAVA_HOME system environments and path environments are already set, and can work when I run Jupyter locally. But only on Vertex AI's Workbench JupyterLab I get this error.

Code:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Jupyter BigQuery Storage')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.getOrCreate()

Full Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_3404/1949393828.py in <module>
      9 spark = SparkSession.builder \
     10   .appName('Jupyter BigQuery Storage')\
---> 11   .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
     12   .getOrCreate()
     13 

/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    390         with SparkContext._lock:
    391             if SparkContext._active_spark_context is None:
--> 392                 SparkContext(conf=conf or SparkConf())
    393             return SparkContext._active_spark_context
    394 

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    337         with SparkContext._lock:
    338             if not SparkContext._gateway:
--> 339                 SparkContext._gateway = gateway or launch_gateway(conf)
    340                 SparkContext._jvm = SparkContext._gateway.jvm
    341 

/opt/conda/lib/python3.7/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise RuntimeError("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

RuntimeError: Java gateway process exited before sending its port number

Do let me know if you have advice or help, thank you!

raul_saucedo

You would need to have Java installed on your Mac, Linux or Windows, without Java installation & not having JAVA_HOME environment variable set with Java installation path or not having PYSPARK_SUBMIT_ARGS, you would get this Exception.

You need to Set PYSPARK_SUBMIT_ARGS with master, this resolves Exception: Java gateway process exited before sending the driver its port number.

export PYSPARK_SUBMIT_ARGS="--master local[3] pyspark-shell"

vi ~/.bashrc , add the above line and reload the bashrc file using source ~/.bashrc

In case the issue is still not resolved, check your Java installation and JAVA_HOME environment variable.

You can see this troubleshooting documentation[1].

[1] https://cloud.google.com/vertex-ai/docs/general/troubleshooting

Google Cloud Platform - Vertex AI - Workbench JupyterLab - Spark/Hadoop - JAVA_HOME is not set error