Hi,
I want to run a PySpark stored procedure using a custom container image, but when I do, I receive the following error:
‘Job failed. Please check logs at EXECUTION DETAILS from the console or Cloud Logging.’
However, when I check the Execution Details, there is no information displayed.
I’ve set up the connection and service account with all the required permissions. The custom Docker image is uploaded to the Artifact Registry.
Initially, I received errors indicating that the user couldn’t pull the image due to insufficient permissions, but after fixing that, no further information regarding the error is displayed. I only get the following message: ‘Job failed. Please check logs at EXECUTION DETAILS from the console or Cloud Logging,’ with no information in the Execution Details.
Can someone help me?
Thanks in advance,
This is a common issue when working with custom container images in BigQuery PySpark procedures. Here is a step by step approach:
Deep Dive into Logs:
Verify Custom Container Image:
Procedure Definition:
LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE:TAG
Additional Tips:
Example:
CREATE OR REPLACE PROCEDURE my_dataset.my_pyspark_procedure()
OPTIONS(
runtime_version='3.2', -- Or your preferred version
container_image='LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/my-pyspark-image:latest'
)
BEGIN
-- Your PySpark code here
END;
Hi, and thanks for the reply!
Here is my Dockerfile, which is a simple example. I need the Slack SDK to run a PySpark stored procedure that sends a Slack notification. Because of the Slack SDK, I need to use a custom image.
Below is a screenshot from the BQ PySpark stored procedure.
As i mentioned, when I check the Execution Details, there is no information displayed.
I have checked, and the service account has all the required permissions. The image is uploaded in the Artifact Registry.
If I run a sample Python code without the custom image, it works, so my connection should be fine.
I believe the problem might be with the Docker image. Could you provide an example of a Dockerfile that has all the required configurations along with the Slack SDK?
Thanks!
The below Dockerfile includes the Slack SDK and is optimized for use with PySpark and BigQuery.
# Use the official Python 3.9 slim image as a base
FROM python:3.9-slim
# Set the working directory inside the container
WORKDIR /app
# Update package list and install dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends openjdk-11-jre-headless wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install required Python packages
RUN pip install --no-cache-dir \
pyspark \
google-cloud-bigquery \
slack-sdk
# Copy the rest of the application code into the container
COPY . .
# Define the default command to run the container
CMD ["python", "-c", "print('Container is ready to run your Spark application')"]
Troubleshooting Steps:
docker build -t my-pyspark-slack-image .
docker run my-pyspark-slack-image
docker run -it my-pyspark-slack-image /bin/bash
pyspark --master local[2] -c "from slack_sdk import WebClient; print('PySpark and Slack SDK are working')"
container_image
URI should be in the correct format:LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE:TAG
CREATE OR REPLACE PROCEDURE my_dataset.my_pyspark_procedure()
OPTIONS(
runtime_version='3.2', -- Or your preferred version
container_image='LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/my-pyspark-image:latest'
)
BEGIN
-- Your PySpark code here
import os
from datetime import datetime, timedelta, date
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
print("Test")
END;
Additional Tips:
CMD
in the Dockerfile to provide more verbose logging, which can help identify issues during execution.CMD ["python", "-c", "print('Container is ready to run your Spark application'); import pyspark; print('PySpark version:', pyspark.__version__); from slack_sdk import WebClient; print('Slack SDK is imported successfully')"]
roles/artifactregistry.reader
and roles/storage.objectViewer
permissions.resource.type="bigquery_job"
severity="ERROR"
Look for any error messages or stack traces that might give more insight into the issue.