Pyspark job execution error in Dataproc- correlati...

khanumantha · 06-29-2023 10:53 PM

Data proc throwing errors while pyspark code which is developed in lower of pyspark of Data proc. How to handle these kind of issues, is there any backword copatabilty plugin to get lower version code exeucted in Data proc. On primse version of pyspark 2.4.7 and in GCP data proc version of Pyspark is 3.1.3. Let us know if you have seen any backward compatability issues while running the pyspark jobs in Dataproc of GCP.

ms4446

When running PySpark jobs in Google Cloud Dataproc, the job code must be compatible at runtime with the Python interpreter's version and dependencies on the cluster https://cloud.google.com/dataproc/docs/tutorials/python-configuration Unfortunately, there is no specific backward compatibility plugin provided by Dataproc or GCP that allows you to execute lower version code directly on a higher version cluster.

One solution to manage version dependencies is to use Conda environments. Conda is an open-source package management and environment management system, and it can be used to create a Python environment with a specific version of PySpark that matches your code https://spark.apache.org/docs/latest/api/python/getting_started/install.html

However, please note that after a cluster is created, there isn't a great way to re-configure all workers. If you want to create a new Python environment with a specific version of PySpark, it's best to do so when creating a new cluster

Pyspark job execution error in Dataproc- correlation column is not allowed in predicate