Hi, when I try to submit a python file to batch I had the next error:
The problem was on the path created for the file in gcs gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 44: gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
at java.base/java.net.URI.create(URI.java:906)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:628)
at com.google.cloud.hadoop.services.agent.job.handler.PySparkJobHandler.configSparkCommand(PySparkJobHandler.java:191)
at com.google.cloud.hadoop.services.agent.job.handler.PySparkJobHandler.configCommand(PySparkJobHandler.java:82)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.buildCommand(AbstractJobHandler.java:248)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.prepareDriver(AbstractJobHandler.java:888)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.URISyntaxException: Illegal character in path at index 44: gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
at java.base/java.net.URI$Parser.fail(URI.java:2976)
at java.base/java.net.URI$Parser.checkChars(URI.java:3147)
at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3229)
at java.base/java.net.URI$Parser.parse(URI.java:3177)
at java.base/java.net.URI.<init>(URI.java:623)
at java.base/java.net.URI.create(URI.java:904)
... 14 more
======== End of Cloud Dataproc Agent Error ========
Solved! Go to Solution.
It seems like there is an issue wit the cloud dataproc submit
command is copying the file to an incorrect path on GCS. This seems to be an issue with the command or library, not on your end.
To work around this problem, you can manually copy the file to GCS using the gsutil cp
command, ensuring it is placed in a valid path. For example, if your file is called main.py
and is located in the dependencies
folder on your local machine, you can use the following command:
gsutil cp dependencies/main.py gs://my-bucket/main.py
After copying the file to a valid path, you can use the cloud dataproc submit
command to submit it to your Cloud Dataproc cluster:
cloud dataproc jobs submit pyspark gs://my-bucket/main.py --cluster=my-cluster
This command will copy the main.py
file to the my-bucket
bucket in GCS, and then the cloud dataproc submit
command will submit the file to your Cloud Dataproc cluster.
The error message indicates that the path to your Python file contains an illegal character: \
. In UNIX-like systems, which Cloud Dataproc is based on, the forward slash (/
) is used as the directory separator. The presence of the backslash is causing the Cloud Dataproc Agent to fail in parsing the path, and as a result, it cannot submit the job.
To fix the error, you need to replace the \
character with a /
in the path to your Python file. For instance, change gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
to gs://rs-nprd-dproc-gdf-cnf-file/dependencies/main.py
.
After correcting the path, you can resubmit the job.
Additional tips for submitting jobs to Cloud Dataproc:
It seems like there is an issue wit the cloud dataproc submit
command is copying the file to an incorrect path on GCS. This seems to be an issue with the command or library, not on your end.
To work around this problem, you can manually copy the file to GCS using the gsutil cp
command, ensuring it is placed in a valid path. For example, if your file is called main.py
and is located in the dependencies
folder on your local machine, you can use the following command:
gsutil cp dependencies/main.py gs://my-bucket/main.py
After copying the file to a valid path, you can use the cloud dataproc submit
command to submit it to your Cloud Dataproc cluster:
cloud dataproc jobs submit pyspark gs://my-bucket/main.py --cluster=my-cluster
This command will copy the main.py
file to the my-bucket
bucket in GCS, and then the cloud dataproc submit
command will submit the file to your Cloud Dataproc cluster.