Re: Submit PySpark job on Dataproc Servless

DanteQP · 11-07-2024 04:59 AM

How to submit a PySpark job on Dataproc Servless ?

I need to submit not just a single Python file, but an entire Python project. In addition to main.py, I need to include other files like config.json, requirements.txt, and additional Python files that main.py references and imports.

For example, I have this project structure, where main.py imports helper, logger, etc., and uses config.json for initial configurations. Also, I need the packages in requirements.txt to be installed.

In short, I need the job to run main.py, but the entire project must be available for it to execute on Dataproc Servless.

project/
├── main.py
├── file1.py
├── file2.py
├── config.json
├── requirements.txt
│
├── utils/
│ ├── helper.py
│ └── logger.py
│
├── services/
│ ├── service1.py
│ └── service2.py

NorieRam

Hi @DanteQP,

Welcome to Google Cloud Community!

You're on the right track wanting to package your entire project for Dataproc Serverless. The key is to package your project correctly so all your files and dependencies are available.

Here’s the breakdown:

Create a .zip archive of your project directory. This bundles everything together.
```
zip -r project.zip project/
```

Use the command to . The key here is to use the --py-files option to include your zipped project.

gcloud dataproc batches submit pyspark \
--region <your-region> \
--project <your-project-id> \
main.py \
--py-files project.zip \
--deps-bucket <your-gcs-bucket>  # Optional for caching (recommended)

Manage Dependencies (requirements.txt):

There are a couple of ways to handle dependencies:

Option 1: If your dependencies are relatively simple (e.g., numpy, pandas), you can list them directly using the --packages flag.
```
--packages numpy,pandas,requests  # Replace with your actual packages
```
Option 2: Use Initialization Actions: Since you mentioned that you need the packages listed in the requirements.txt file, you can use a bootstrapping script for installation.
```
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--metadata=PIP_PACKAGES=$(cat requirements.txt | tr '\n' ',')
```

Dataproc Serverless automatically handles the inclusion of files in the zip, so within your main.py, you can access those files as if they were in the same directory.

import json
from utils.helper import some_helper_function  # Import from other Python files
from services.service1 import Service1

with open("config.json", "r") as f:        # Access config.json
    config = json.load(f)

service = Service1(config)                  # Use your config
some_helper_function()                     # Call functions from other files

# Your main Spark logic here...

By packaging your project and using --py-files, you ensure that all necessary files and dependencies are available to your Spark job during execution. This makes sure everything works as expected in Dataproc Serverless.

I hope this helps.