How to submit a PySpark job on Dataproc Servless ?
I need to submit not just a single Python file, but an entire Python project. In addition to main.py, I need to include other files like config.json, requirements.txt, and additional Python files that main.py references and imports.
For example, I have this project structure, where main.py imports helper, logger, etc., and uses config.json for initial configurations. Also, I need the packages in requirements.txt to be installed.
In short, I need the job to run main.py, but the entire project must be available for it to execute on Dataproc Servless.
project/
├── main.py
├── file1.py
├── file2.py
├── config.json
├── requirements.txt
│
├── utils/
│ ├── helper.py
│ └── logger.py
│
├── services/
│ ├── service1.py
│ └── service2.py
Hi @DanteQP,
Welcome to Google Cloud Community!
You're on the right track wanting to package your entire project for Dataproc Serverless. The key is to package your project correctly so all your files and dependencies are available.
Here’s the breakdown:
.zip
archive of your project directory. This bundles everything together.zip -r project.zip project/
--py-files
option to include your zipped project.gcloud dataproc batches submit pyspark \
--region <your-region> \
--project <your-project-id> \
main.py \
--py-files project.zip \
--deps-bucket <your-gcs-bucket> # Optional for caching (recommended)
Manage Dependencies (requirements.txt):
There are a couple of ways to handle dependencies:
Option 1: If your dependencies are relatively simple (e.g., numpy
, pandas
), you can list them directly using the --packages
flag.
--packages numpy,pandas,requests # Replace with your actual packages
Option 2: Use Initialization Actions: Since you mentioned that you need the packages listed in the requirements.txt
file, you can use a bootstrapping script for installation.--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--metadata=PIP_PACKAGES=$(cat requirements.txt | tr '\n' ',')
main.py
, you can access those files as if they were in the same directory.import json
from utils.helper import some_helper_function # Import from other Python files
from services.service1 import Service1
with open("config.json", "r") as f: # Access config.json
config = json.load(f)
service = Service1(config) # Use your config
some_helper_function() # Call functions from other files
# Your main Spark logic here...
By packaging your project and using --py-files
, you ensure that all necessary files and dependencies are available to your Spark job during execution. This makes sure everything works as expected in Dataproc Serverless.
I hope this helps.