Solved: Re: Dataproc, Error when submit a file to dataproc...

Frednog2023 · 09-22-2023 12:12 PM

Hi, when I try to submit a python file to batch I had the next error:

The problem was on the path created for the file in gcs gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py

=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 44: gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
at java.base/java.net.URI.create(URI.java:906)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:628)
at com.google.cloud.hadoop.services.agent.job.handler.PySparkJobHandler.configSparkCommand(PySparkJobHandler.java:191)
at com.google.cloud.hadoop.services.agent.job.handler.PySparkJobHandler.configCommand(PySparkJobHandler.java:82)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.buildCommand(AbstractJobHandler.java:248)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.prepareDriver(AbstractJobHandler.java:888)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.URISyntaxException: Illegal character in path at index 44: gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py
at java.base/java.net.URI$Parser.fail(URI.java:2976)
at java.base/java.net.URI$Parser.checkChars(URI.java:3147)
at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3229)
at java.base/java.net.URI$Parser.parse(URI.java:3177)
at java.base/java.net.URI.<init>(URI.java:623)
at java.base/java.net.URI.create(URI.java:904)
... 14 more
======== End of Cloud Dataproc Agent Error ========

ms4446

It seems like there is an issue wit the cloud dataproc submit command is copying the file to an incorrect path on GCS. This seems to be an issue with the command or library, not on your end.

To work around this problem, you can manually copy the file to GCS using the gsutil cp command, ensuring it is placed in a valid path. For example, if your file is called main.py and is located in the dependencies folder on your local machine, you can use the following command:

gsutil cp dependencies/main.py gs://my-bucket/main.py

After copying the file to a valid path, you can use the cloud dataproc submit command to submit it to your Cloud Dataproc cluster:

cloud dataproc jobs submit pyspark gs://my-bucket/main.py --cluster=my-cluster

This command will copy the main.py file to the my-bucket bucket in GCS, and then the cloud dataproc submit command will submit the file to your Cloud Dataproc cluster.

View solution in original post

ms4446

The error message indicates that the path to your Python file contains an illegal character: \. In UNIX-like systems, which Cloud Dataproc is based on, the forward slash (/) is used as the directory separator. The presence of the backslash is causing the Cloud Dataproc Agent to fail in parsing the path, and as a result, it cannot submit the job.

To fix the error, you need to replace the \ character with a / in the path to your Python file. For instance, change gs://rs-nprd-dproc-gdf-cnf-file/dependencies\main.py to gs://rs-nprd-dproc-gdf-cnf-file/dependencies/main.py.

After correcting the path, you can resubmit the job.

Additional tips for submitting jobs to Cloud Dataproc:

Ensure the path to your Python file is correct.
Confirm that the Python file is executable.
Check that the Python file has the necessary permissions.
If you're using a custom Python interpreter, ensure it's installed on the Cloud Dataproc cluster.

Frednog2023

Hi, the command cloud dataproc submit and copy the file in the Path, any file send by the command copy the wrong path on GCS, that behavior is a bad process of the libraries, not mine.

Obtener Outlook para Android<>
La informaci?n contenida en esta comunicaci?n, no representa el consentimiento de Rimac Seguros y Reaseguros o Rimac S.A. Entidad Prestadora de Salud (ambas en adelante RIMAC) para la celebraci?n de un contrato de seguros, por lo tanto cualquier informaci?n brindada en esta comunicaci?n, solo representa las opiniones y puntos de vista del autor, no obligando a RIMAC.

Este mensaje se encuentra dirigido exclusivamente para uso del destinatario previsto y contiene informaci?n confidencial y/o privilegiada. Si Ud. no es el destinatario al que se dirigi? el mensaje, se le notifica por este medio que la divulgaci?n, copia, distribuci?n o cualquier actividad tomada a partir del contenido de este mensaje se encuentra terminantemente prohibida y es sancionada por la ley. Si Ud. ha recibido este mensaje por error por favor proceda a eliminarlo y notificar inmediatamente al remitente.

ms4446

It seems like there is an issue wit the cloud dataproc submit command is copying the file to an incorrect path on GCS. This seems to be an issue with the command or library, not on your end.

To work around this problem, you can manually copy the file to GCS using the gsutil cp command, ensuring it is placed in a valid path. For example, if your file is called main.py and is located in the dependencies folder on your local machine, you can use the following command:

gsutil cp dependencies/main.py gs://my-bucket/main.py

After copying the file to a valid path, you can use the cloud dataproc submit command to submit it to your Cloud Dataproc cluster:

cloud dataproc jobs submit pyspark gs://my-bucket/main.py --cluster=my-cluster

This command will copy the main.py file to the my-bucket bucket in GCS, and then the cloud dataproc submit command will submit the file to your Cloud Dataproc cluster.

Dataproc, Error when submit a file to dataproc serverless