Re: BigQuery PySpark procedure with Custom contain...

rnedelcu · 05-29-2024 04:35 AM

Hi,

I want to run a PySpark stored procedure using a custom container image, but when I do, I receive the following error:

‘Job failed. Please check logs at EXECUTION DETAILS from the console or Cloud Logging.’

However, when I check the Execution Details, there is no information displayed.

I’ve set up the connection and service account with all the required permissions. The custom Docker image is uploaded to the Artifact Registry.

Initially, I received errors indicating that the user couldn’t pull the image due to insufficient permissions, but after fixing that, no further information regarding the error is displayed. I only get the following message: ‘Job failed. Please check logs at EXECUTION DETAILS from the console or Cloud Logging,’ with no information in the Execution Details.

Can someone help me?

Thanks in advance,

ms4446

This is a common issue when working with custom container images in BigQuery PySpark procedures. Here is a step by step approach:

Deep Dive into Logs:

Cloud Logging: Navigate to the Cloud Logging console and filter for logs related to your BigQuery job. Look for error messages that might provide more context than the generic "Job failed" message.
Error Reporting: Check Google Cloud Error Reporting for any error events associated with your BigQuery job.
Execution Details (Advanced): Even though Execution Details might seem empty, sometimes clicking on the "View Logs" button (if available) or inspecting the job configuration can reveal hidden error messages.

Verify Custom Container Image:

Permissions: Double-check that your service account (used by BigQuery) has the following permissions on the Artifact Registry:
- Artifact Registry Reader
- Storage Object Viewer (if using Google Cloud Storage for image storage)
Image Content:
- Base Image: Ensure your base image is compatible with the BigQuery PySpark environment. The official Python images (e.g., python:3.9-slim) are a good starting point.
- Dependencies: Confirm that all required Python packages and libraries are installed within your custom image.
- Entry Point: Make sure the entry point script (the one you specify in your procedure definition) is correctly set up to execute your PySpark code.

Procedure Definition:

Image URI: Verify that the image URI in your procedure definition is accurate and points to the correct image in the Artifact Registry. It should be in the following format:

LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE:TAG

Resource Allocation: If your PySpark job is resource-intensive, consider increasing the job's memory and CPU allocation in the procedure definition.

Additional Tips:

Test Locally: Try running your PySpark code locally with the same custom container image to isolate any potential issues with the image itself.
Simplify: Start with a minimal custom image containing only the bare essentials. Gradually add more dependencies as needed to pinpoint any problematic components.
Community: Reach out to the Google Cloud Community forums. There's a good chance someone has encountered a similar issue and can offer insights.

Example:

 

CREATE OR REPLACE PROCEDURE my_dataset.my_pyspark_procedure()
OPTIONS(
  runtime_version='3.2',  -- Or your preferred version
  container_image='LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/my-pyspark-image:latest'
)
BEGIN
  -- Your PySpark code here
END;

rnedelcu

Hi, and thanks for the reply!

Here is my Dockerfile, which is a simple example. I need the Slack SDK to run a PySpark stored procedure that sends a Slack notification. Because of the Slack SDK, I need to use a custom image.

Below is a screenshot from the BQ PySpark stored procedure.

As i mentioned, when I check the Execution Details, there is no information displayed.

I have checked, and the service account has all the required permissions. The image is uploaded in the Artifact Registry.

If I run a sample Python code without the custom image, it works, so my connection should be fine.

I believe the problem might be with the Docker image. Could you provide an example of a Dockerfile that has all the required configurations along with the Slack SDK?

Thanks!

ms4446

The below Dockerfile includes the Slack SDK and is optimized for use with PySpark and BigQuery.

 

# Use the official Python 3.9 slim image as a base
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Update package list and install dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends openjdk-11-jre-headless wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Install required Python packages
RUN pip install --no-cache-dir \
    pyspark \
    google-cloud-bigquery \
    slack-sdk

# Copy the rest of the application code into the container
COPY . .

# Define the default command to run the container
CMD ["python", "-c", "print('Container is ready to run your Spark application')"]

Troubleshooting Steps:

Verify the Docker Image: Build and run the Docker image locally to ensure it works as expected.

docker build -t my-pyspark-slack-image .
docker run my-pyspark-slack-image

Test Locally with PySpark: Run a simple PySpark job within the container to verify that PySpark and the Slack SDK work correctly together.

 
docker run -it my-pyspark-slack-image /bin/bash
pyspark --master local[2] -c "from slack_sdk import WebClient; print('PySpark and Slack SDK are working')"

Check BigQuery Job Configuration: Ensure the BigQuery PySpark job is configured correctly to use the custom container image. The container_image URI should be in the correct format:

LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/IMAGE:TAG

Example Procedure Definition:

CREATE OR REPLACE PROCEDURE my_dataset.my_pyspark_procedure()
OPTIONS(
  runtime_version='3.2',  -- Or your preferred version
  container_image='LOCATION-docker.pkg.dev/PROJECT-ID/REPOSITORY/my-pyspark-image:latest'
)
BEGIN
  -- Your PySpark code here
  import os
  from datetime import datetime, timedelta, date 
  from slack_sdk import WebClient
  from slack_sdk.errors import SlackApiError

  print("Test")
END;

Additional Tips:

Verbose Logging: Modify the CMD in the Dockerfile to provide more verbose logging, which can help identify issues during execution.

 

CMD ["python", "-c", "print('Container is ready to run your Spark application'); import pyspark; print('PySpark version:', pyspark.__version__); from slack_sdk import WebClient; print('Slack SDK is imported successfully')"]

Service Account Permissions: Ensure the service account has the roles/artifactregistry.reader and roles/storage.objectViewer permissions.
Network and Firewall Rules: Verify that there are no network or firewall rules blocking access to the Artifact Registry or other required resources.
Checking Cloud Logging:
1. Navigate to Cloud Logging:
2. Use the following query to filter logs related to BigQuery:

resource.type="bigquery_job"
severity="ERROR"

Look for any error messages or stack traces that might give more insight into the issue.

BigQuery PySpark procedure with Custom container image