So the pyspark jobs that I have developed run fine in local spark environment (developer setup) but when running in Dataproc it fails with the below error,
"Failed to load PySpark version file for packaging. You must be in Spark's python dir.
"
There seems to be nothing wrong with the cluster as such, able to submit other jobs. My guess is that the issue is because I am running a job which is inside two packages inside but want to check if this is the case or if something else is causing the issue.
Solved! Go to Solution.
So this was a really silly issue. The file paths are different in GCS i.e they have a flat file structure and when we navigate like we do in a proper file system then it will cause the above issue. Need to use google storage client to get file content even when running from data proc.
So this was a really silly issue. The file paths are different in GCS i.e they have a flat file structure and when we navigate like we do in a proper file system then it will cause the above issue. Need to use google storage client to get file content even when running from data proc.