Solved: Dataproc Pyspark job unable to import Google Secre... - Page 2

annabel · 11-07-2023 08:45 AM

I'm trying to set up a Dataproc workflow where I extract data from one API and load it into BigQuery, but in order to do this, I need to get a secret api key from Secret Manager, and herein lies the problem.

I get the following error message, regardless of my attempts at installing secret manager into the dataproc cluster via initialization bash script or even SSHing into the VM to pip install the library.

ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)

I've tried googling the error, looking at Stackoverflow, reading GCP documentation and yet, the snippet above stays. Taunting me. For hours.

There was one thread that indicated that PySpark in general just can't work with Google Secret Manager (link: https://cloud.google.com/knowledge/kb/secretmanager-api-can-not-be-accessed-from-dataproc-job-000004...), but I'm hoping there's still a way here.

Here's my setup, in the case anyone could give me any helpful pointers.

1. Dataproc cluster:

Running on Compute Engine.
- My default Compute Engine has SecretManager permsissions.
Has an installation bash script running as an intializer (Code pasted below)

#!/bin/bash
pip3 install --user google-cloud-secret-manager

Has metadata set to: PIP_PACKAGES==google-cloud-secret-manager

2. Dataproc job:

In my Dataproc pyspark job, I have my library imports, one of which is:

from google.cloud import secretmanager

I've also tried importing the package as:

import google.cloud.secretmanager as secretmanager

But sadly, no, this does not work either.

Update: Updated initialization bash script - just realized that my c+p code for the python library initialization wasn't fully captured.

ms4446

Hi @annabel ,

Based on the information you've provided, it appears that the google-cloud-secret-manager library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.

Installing the Library in the Conda Environment:

Activate the Conda Environment:
- To activate the Conda environment within a bash script, use the source activate <environment_name> command. The conda activate command is intended for interactive use and may not work in scripts.
Install the Library:
- Once the environment is activated, install the google-cloud-secret-manager library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
  
  /opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager

Initialization Action Script:

Modify your initialization action script to include the activation of the Conda environment and the installation of the library. Ensure that the script sets the PYSPARK_PYTHON environment variable to point to the Python executable within the Conda environment.

Dataproc Configuration:

Confirm that the Dataproc configuration specifies the correct Python interpreter from the Conda environment. This can be done by setting the spark.pyspark.python and spark.pyspark.driver.python properties to the path of the Python executable in the Conda environment.

Testing the Installation:

After updating your initialization script, create a new Dataproc cluster and run a test PySpark job that attempts to import the google-cloud-secret-manager library to ensure it's installed correctly.

Using Secret Manager for API Keys:

It is indeed recommended to use Google Secret Manager for storing and managing API keys due to its robust security features, such as encryption and access control. This is preferable to storing keys in your code, initialization scripts, or even Cloud Storage.

Alternative Options:

If you continue to face issues with the Secret Manager library, as a temporary workaround, you could store the API key in a secure Cloud Storage bucket. However, this is less secure than using Secret Manager and should be done with caution.
An external secret manager like HashiCorp Vault could be considered, but ensure it's compatible with Dataproc and that you have the necessary mechanisms for authentication and secret retrieval within your PySpark job.

Security Considerations:

Always prioritize using Secret Manager directly due to its designed purpose of handling secrets securely. If you must use alternative methods, ensure they are implemented with the highest security standards in mind.

View solution in original post

annabel

Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.

After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)

Specifically, when creating a cluster via console:

Navigate to: Customise cluster (optional)
Click on: + ADD PROPERTIES
For the Prefix 1, select, dataproc
For Key 1, type in, conda.packages
For Value 1, type in, google-cloud-secret-manager==2.16.3

Now, conda installs the library in the correct spot when the cluster is being spun up.

startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).

(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):

gcloud beta dataproc clusters create my-cluster \
    --image-version=1.5 \
    --region=${REGION} \
    --optional-components=ANACONDA \
    --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

Thanks again for your time in helping me to debug this! Your input was really helpful!

View solution in original post

Dataproc Pyspark job unable to import Google Secret Manager Library