Solved: Dataproc Pyspark job unable to import Google Secre...

annabel · 11-07-2023 08:45 AM

I'm trying to set up a Dataproc workflow where I extract data from one API and load it into BigQuery, but in order to do this, I need to get a secret api key from Secret Manager, and herein lies the problem.

I get the following error message, regardless of my attempts at installing secret manager into the dataproc cluster via initialization bash script or even SSHing into the VM to pip install the library.

ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)

I've tried googling the error, looking at Stackoverflow, reading GCP documentation and yet, the snippet above stays. Taunting me. For hours.

There was one thread that indicated that PySpark in general just can't work with Google Secret Manager (link: https://cloud.google.com/knowledge/kb/secretmanager-api-can-not-be-accessed-from-dataproc-job-000004...), but I'm hoping there's still a way here.

Here's my setup, in the case anyone could give me any helpful pointers.

1. Dataproc cluster:

Running on Compute Engine.
- My default Compute Engine has SecretManager permsissions.
Has an installation bash script running as an intializer (Code pasted below)

#!/bin/bash
pip3 install --user google-cloud-secret-manager

Has metadata set to: PIP_PACKAGES==google-cloud-secret-manager

2. Dataproc job:

In my Dataproc pyspark job, I have my library imports, one of which is:

from google.cloud import secretmanager

I've also tried importing the package as:

import google.cloud.secretmanager as secretmanager

But sadly, no, this does not work either.

Update: Updated initialization bash script - just realized that my c+p code for the python library initialization wasn't fully captured.

ms4446

Hi @annabel ,

Based on the information you've provided, it appears that the google-cloud-secret-manager library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.

Installing the Library in the Conda Environment:

Activate the Conda Environment:
- To activate the Conda environment within a bash script, use the source activate <environment_name> command. The conda activate command is intended for interactive use and may not work in scripts.
Install the Library:
- Once the environment is activated, install the google-cloud-secret-manager library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
  
  /opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager

Initialization Action Script:

Modify your initialization action script to include the activation of the Conda environment and the installation of the library. Ensure that the script sets the PYSPARK_PYTHON environment variable to point to the Python executable within the Conda environment.

Dataproc Configuration:

Confirm that the Dataproc configuration specifies the correct Python interpreter from the Conda environment. This can be done by setting the spark.pyspark.python and spark.pyspark.driver.python properties to the path of the Python executable in the Conda environment.

Testing the Installation:

After updating your initialization script, create a new Dataproc cluster and run a test PySpark job that attempts to import the google-cloud-secret-manager library to ensure it's installed correctly.

Using Secret Manager for API Keys:

It is indeed recommended to use Google Secret Manager for storing and managing API keys due to its robust security features, such as encryption and access control. This is preferable to storing keys in your code, initialization scripts, or even Cloud Storage.

Alternative Options:

If you continue to face issues with the Secret Manager library, as a temporary workaround, you could store the API key in a secure Cloud Storage bucket. However, this is less secure than using Secret Manager and should be done with caution.
An external secret manager like HashiCorp Vault could be considered, but ensure it's compatible with Dataproc and that you have the necessary mechanisms for authentication and secret retrieval within your PySpark job.

Security Considerations:

Always prioritize using Secret Manager directly due to its designed purpose of handling secrets securely. If you must use alternative methods, ensure they are implemented with the highest security standards in mind.

View solution in original post

annabel

Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.

After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)

Specifically, when creating a cluster via console:

Navigate to: Customise cluster (optional)
Click on: + ADD PROPERTIES
For the Prefix 1, select, dataproc
For Key 1, type in, conda.packages
For Value 1, type in, google-cloud-secret-manager==2.16.3

Now, conda installs the library in the correct spot when the cluster is being spun up.

startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).

(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):

gcloud beta dataproc clusters create my-cluster \
    --image-version=1.5 \
    --region=${REGION} \
    --optional-components=ANACONDA \
    --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

Thanks again for your time in helping me to debug this! Your input was really helpful!

View solution in original post

annabel

If by chance my earlier solution didn't help, these steps will also work:

Save the following file into a bucket: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/python/pip-install.sh
If using the GUI:
1. When creating a new cluster, for the initialization stage, point to this file
2. Add onto the metadata the following key-value pair:
  1. KEY: PIP_PACKAGES
  2. VALUE: google-cloud-secret-manager==2.16.3
If not using the GUI, add this onto your command:
1. --initialization-actions 'gs://{YOUR BUCKET}/pip-install.sh' --metadata PIP_PACKAGES=google-cloud-secret-manager==2.16.3

This should work as well.

View solution in original post

ms4446

The error message "ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)" indicates that the google-cloud-secret-manager library is not installed or not properly installed on the Dataproc cluster. Despite your attempts to install the library using the initialization bash script and SSHing into the VM, the library is not being recognized by your PySpark job.

To resolve this issue, follow these enhanced steps:

Initialization Bash Script:

Verify Installation Command: Ensure the pip3 install --user google-cloud-secret-manager command is correctly executed in the initialization bash script. Double-check for typos, especially in the library name.
Script Execution: Confirm that the initialization bash script is being invoked during cluster creation by reviewing the initialization action logs for any errors.
Script Permissions: Check the initialization bash script for the correct file permissions and ownership to ensure it can execute properly.

Python Environment:

Library Installation Path: Verify that the .local/lib/python<version>/site-packages directory is included in the sys.path. If not, add it manually.
Dataproc and Conda: If using Conda on Dataproc, activate the appropriate Conda environment before running the pip install command.
Python Version: Confirm that the Python version used for the pip install command matches the version used by PySpark.

Additional Considerations:

Dataproc Image Version: Use a Dataproc image version that supports the required Python version for google-cloud-secret-manager.
Logging and Debugging: Implement logging within the initialization script to capture detailed output of the installation process for troubleshooting.
Dependency Conflicts: Check for any dependency conflicts that might be causing issues with importing the secretmanager module.
Virtual Environments: Utilize Python virtual environments to manage dependencies and avoid system package conflicts.
Initialization Actions: If the script is complex, consider using multiple actions or leveraging community-provided initialization actions.
Google Cloud SDK: Ensure the Google Cloud SDK on the Dataproc cluster is up to date, as it includes tools for interacting with Google Cloud services.
Isolated Environment Testing: Test the installation and import process in an environment that closely replicates the Dataproc cluster's setup.
Service Account Scopes: Check that the service account attached to the Dataproc cluster has the correct scopes enabled for accessing the Secret Manager API.

annabel

Thank you for getting back to me so quickly! In response to your suggestions:

Initialization Bash Script:

Verify Installation Command: Ensure the pip3 install --user google-cloud-secret-manager command is correctly executed in the initialization bash script. Double-check for typos, especially in the library name.

It was my error in copying and pasting the initialization script - the code snippet is as you've written. I've updated the spelling in my original post

Script Execution: Confirm that the initialization bash script is being invoked during cluster creation by reviewing the initialization action logs for any errors.

I've gone ahead and checked the initialization action logs and saw that the installation of the library had gone through -

Successfully installed cachetools-5.3.2 google-api-core-2.12.0 google-auth-2.23.4 google-cloud-secret-manager-2.16.4 googleapis-common-protos-1.61.0 grpc-google-iam-v1-0.12.6 grpcio-1.59.2 grpcio-status-1.59.2 proto-plus-1.22.3 protobuf-4.25.0 pyasn1-0.5.0 pyasn1-modules-0.3.0 rsa-4.9

Script Permissions: Check the initialization bash script for the correct file permissions and ownership to ensure it can execute properly.

I actually wasn't quite sure how to do this step, unfortunately. My default Compute Engine does have permission rights to SecretManager - is this the same thing?

ms4446

It's good to hear that the library installation was successful according to your logs. This suggests that the issue might not be with the installation itself but perhaps with the environment where your PySpark job is running.

Regarding script permissions, what you're looking for is whether the initialization script has the execute permission set, which it likely does if it ran and installed the packages. However, since you've confirmed that the installation was successful, this is probably not the issue.

The permissions related to your Compute Engine service account are different from the script file permissions. The service account permissions determine what resources and services your Dataproc cluster can access on Google Cloud Platform. In contrast, file permissions (like execute permissions for a bash script) are about what can be done with a file on the filesystem of the virtual machines in your cluster.

Since the library is installed, but your PySpark job cannot find it, here are a few additional steps to consider:

PySpark Environment: Ensure that the PySpark environment is using the same Python interpreter where the google-cloud-secret-manager library was installed. If PySpark is using a different interpreter, it won't have access to the library. You can set the PYSPARK_PYTHON environment variable to point to the correct interpreter.
Python Interpreter: When you SSH into the VM, you can check which Python interpreter is being used by default. Run which python and which python3 to see the paths to the Python interpreters. Then, use pip list or pip3 list to see if the google-cloud-secret-manager is listed there.
Dataproc Versions: Ensure that the version of Dataproc you are using is compatible with the google-cloud-secret-manager library. Although the library is installed, there might be compatibility issues with certain versions of Dataproc.
Job Submission: When you submit your PySpark job, you can specify the Python version and packages with the --properties flag. For example, you might need to set spark.pyspark.python to the path of the Python interpreter that has the library installed.
Python Package Conflicts: It's possible that there are conflicting packages or versions that are causing issues. You might need to create a clean Python environment, install only the necessary packages, and then configure PySpark to use that environment.
Use Initialization Actions to Set Environment Variables: You can use initialization actions not only to install packages but also to set environment variables or perform other setup tasks that might be necessary for your PySpark job to run correctly.

If you've confirmed that the library is installed and accessible via the Python interpreter on the VM, but PySpark still can't find it, the issue is likely with how PySpark is configured to find Python packages. Adjusting the PYSPARK_PYTHON environment variable or using a virtual environment for your PySpark job could resolve this.

annabel

Got it - thanks for the clarification! (And thank you for continuing to respond to me!) In regards to your reply -

PySpark Environment: Ensure that the PySpark environment is using the same Python interpreter where the google-cloud-secret-manager library was installed. If PySpark is using a different interpreter, it won't have access to the library. You can set the PYSPARK_PYTHON environment variable to point to the correct interpreter.

Well, as dumb as this sounds, I'm actually not really sure where the google-cloud-secret-manager ended up getting installed. 😅 You'll see what I mean in the next response -

Python Interpreter: When you SSH into the VM, you can check which Python interpreter is being used by default. Run which python and which python3 to see the paths to the Python interpreters. Then, use pip list or pip3 list to see if the google-cloud-secret-manager is listed there.

Oddly enough, though I hadn't specified conda on my dataproc cluster, it seems like conda is automatically set in the paths:

annabel:~$ which python
/opt/conda/default/bin/python
annabel:~$ which python3
/opt/conda/default/bin/python3

When looking at the system paths in general, it seems like the paths mostly point to conda -

annabel:~$ python -m site
sys.path = [
'/home/annabel',
'/opt/conda/default/lib/python310.zip',
'/opt/conda/default/lib/python3.10',
'/opt/conda/default/lib/python3.10/lib-dynload',
'/opt/conda/default/lib/python3.10/site-packages',
'/usr/lib/spark/python',
]

USER_BASE: '/home/annabel/.local' (doesn't exist)
USER_SITE: '/home/annabel/.local/lib/python3.10/site-packages' (doesn't exist)
ENABLE_USER_SITE: True

And when using pip list and pip3 list, I don't see google-cloud-secret-manager in any of the lists.

annabel:~$ pip list

...
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4

...

annabel:~$ pip3 list

...
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4

...

Dataproc Versions: Ensure that the version of Dataproc you are using is compatible with the google-cloud-secret-manager library. Although the library is installed, there might be compatibility issues with certain versions of Dataproc.

I guess given the results above, the library doesn't seem to be installed correctly. I double-checked the initialization code, and yes - the logs do indicate that it was successfully installed.

Downloading google_cloud_secret_manager-2.16.4-py2.py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.6/116.6 kB 5.9 MB/s eta 0:00:00
Installing collected packages: google-cloud-secret-manager
Successfully installed google-cloud-secret-manager-2.16.4

Use Initialization Actions to Set Environment Variables: You can use initialization actions not only to install packages but also to set environment variables or perform other setup tasks that might be necessary for your PySpark job to run correctly.

Given that I'm using a secret API key, I think I should be using Secret Manager for this though, right? In the case I just can't get Dataproc running with this library though, I could also look into this as an alternative option.

ms4446

Hi @annabel ,

Based on the information you've provided, it appears that the google-cloud-secret-manager library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.

Installing the Library in the Conda Environment:

Activate the Conda Environment:
- To activate the Conda environment within a bash script, use the source activate <environment_name> command. The conda activate command is intended for interactive use and may not work in scripts.
Install the Library:
- Once the environment is activated, install the google-cloud-secret-manager library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
  
  /opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager

Initialization Action Script:

Modify your initialization action script to include the activation of the Conda environment and the installation of the library. Ensure that the script sets the PYSPARK_PYTHON environment variable to point to the Python executable within the Conda environment.

Dataproc Configuration:

Confirm that the Dataproc configuration specifies the correct Python interpreter from the Conda environment. This can be done by setting the spark.pyspark.python and spark.pyspark.driver.python properties to the path of the Python executable in the Conda environment.

Testing the Installation:

After updating your initialization script, create a new Dataproc cluster and run a test PySpark job that attempts to import the google-cloud-secret-manager library to ensure it's installed correctly.

Using Secret Manager for API Keys:

It is indeed recommended to use Google Secret Manager for storing and managing API keys due to its robust security features, such as encryption and access control. This is preferable to storing keys in your code, initialization scripts, or even Cloud Storage.

Alternative Options:

If you continue to face issues with the Secret Manager library, as a temporary workaround, you could store the API key in a secure Cloud Storage bucket. However, this is less secure than using Secret Manager and should be done with caution.
An external secret manager like HashiCorp Vault could be considered, but ensure it's compatible with Dataproc and that you have the necessary mechanisms for authentication and secret retrieval within your PySpark job.

Security Considerations:

Always prioritize using Secret Manager directly due to its designed purpose of handling secrets securely. If you must use alternative methods, ensure they are implemented with the highest security standards in mind.

annabel

Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.

After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)

Specifically, when creating a cluster via console:

Navigate to: Customise cluster (optional)
Click on: + ADD PROPERTIES
For the Prefix 1, select, dataproc
For Key 1, type in, conda.packages
For Value 1, type in, google-cloud-secret-manager==2.16.3

Now, conda installs the library in the correct spot when the cluster is being spun up.

startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).

(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):

gcloud beta dataproc clusters create my-cluster \
    --image-version=1.5 \
    --region=${REGION} \
    --optional-components=ANACONDA \
    --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

Thanks again for your time in helping me to debug this! Your input was really helpful!

annabel

If by chance my earlier solution didn't help, these steps will also work:

Save the following file into a bucket: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/python/pip-install.sh
If using the GUI:
1. When creating a new cluster, for the initialization stage, point to this file
2. Add onto the metadata the following key-value pair:
  1. KEY: PIP_PACKAGES
  2. VALUE: google-cloud-secret-manager==2.16.3
If not using the GUI, add this onto your command:
1. --initialization-actions 'gs://{YOUR BUCKET}/pip-install.sh' --metadata PIP_PACKAGES=google-cloud-secret-manager==2.16.3

This should work as well.

Dataproc Pyspark job unable to import Google Secret Manager Library