Solved: Running a job in dataproc cluster via livy

sharontmathew · 02-10-2024 12:48 AM

I have a dataproc cluster with livy installed and it is working.
what is the command to run a 'hello world' script via livy?

ms4446

Here's how to run a 'hello world' script via Livy on a Google Cloud Dataproc cluster. Here is step-by-step instructions and examples for both Python and Scala.

Prerequisites

A running Google Cloud Dataproc cluster with Livy installed and configured.
Basic familiarity with command-line tools (like curl).

Step 1: Create Your 'Hello World' Script

For Python (hello_world.py):

print("Hello, world!")

For Scala (HelloWorld.scala):

object HelloWorld { 
  def main(args: Array[String]): Unit = { 
    println("Hello, world!") 
  } 
}

Important (Scala): You'll need to compile your Scala script into a JAR file using a build tool like SBT or Maven. For example, the sbt assembly command can be used to create the JAR.

Step 2: Submit Your Script to Livy

For Python scripts, submit directly as a batch job:

curl -X POST -d '{ "file": "gs://your-bucket/hello_world.py" }' -H "Content-Type: application/json" http://<cluster-name-m>:8998/batches

For Scala (or Java) JARs, include the className:

curl -X POST -d '{ "file": "gs://your-bucket/hello_world.jar", "className": "HelloWorld" }' -H "Content-Type: application/json" http://<cluster-name-m>:8998/batches

Replace gs://your-bucket/hello_world.py or gs://your-bucket/hello_world.jar with the actual Google Cloud Storage paths to your Python script or Scala JAR.
Also, replace <cluster-name-m> with the hostname of your Dataproc cluster's master node.

Step 3: Monitor the Job Status

curl http://<cluster-name-m>:8998/batches/<batch-id>

Step 4: Retrieve the Output

Once the job completes, the output can typically be found in the Spark job logs. For batch jobs submitted through Livy, check the Dataproc cluster's YARN or Spark UI for the detailed logs and output.

Alternative Method: Using gcloud

As an alternative to Livy, you can submit Spark jobs directly to Dataproc using the gcloud command-line tool. For more information, consult the Google Cloud documentation: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit

Important Considerations

Error Handling: If you encounter errors,check the job's status and logs for troubleshooting.
Dependency Management: For scripts with dependencies, include them in your Scala JAR or specify them using options like --py-files for Python or --jars for Scala/Java when submitting the job.

View solution in original post

ms4446