I'm trying to set up a Dataproc workflow where I extract data from one API and load it into BigQuery, but in order to do this, I need to get a secret api key from Secret Manager, and herein lies the problem.
I get the following error message, regardless of my attempts at installing secret manager into the dataproc cluster via initialization bash script or even SSHing into the VM to pip install the library.
ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)
I've tried googling the error, looking at Stackoverflow, reading GCP documentation and yet, the snippet above stays. Taunting me. For hours.
There was one thread that indicated that PySpark in general just can't work with Google Secret Manager (link: https://cloud.google.com/knowledge/kb/secretmanager-api-can-not-be-accessed-from-dataproc-job-000004...), but I'm hoping there's still a way here.
Here's my setup, in the case anyone could give me any helpful pointers.
1. Dataproc cluster:
#!/bin/bash
pip3 install --user google-cloud-secret-manager
2. Dataproc job:
In my Dataproc pyspark job, I have my library imports, one of which is:
from google.cloud import secretmanager
I've also tried importing the package as:
import google.cloud.secretmanager as secretmanager
But sadly, no, this does not work either.
Update: Updated initialization bash script - just realized that my c+p code for the python library initialization wasn't fully captured.
Solved! Go to Solution.
Hi @annabel ,
Based on the information you've provided, it appears that the google-cloud-secret-manager
library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.
Installing the Library in the Conda Environment:
Activate the Conda Environment:
source activate <environment_name>
command. The conda activate
command is intended for interactive use and may not work in scripts.Install the Library:
google-cloud-secret-manager
library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
/opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager
Initialization Action Script:
PYSPARK_PYTHON
environment variable to point to the Python executable within the Conda environment.Dataproc Configuration:
spark.pyspark.python
and spark.pyspark.driver.python
properties to the path of the Python executable in the Conda environment.Testing the Installation:
google-cloud-secret-manager
library to ensure it's installed correctly.Using Secret Manager for API Keys:
Alternative Options:
Security Considerations:
Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.
After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)
Specifically, when creating a cluster via console:
Now, conda installs the library in the correct spot when the cluster is being spun up.
startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).
(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):
gcloud beta dataproc clusters create my-cluster \
--image-version=1.5 \
--region=${REGION} \
--optional-components=ANACONDA \
--properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Thanks again for your time in helping me to debug this! Your input was really helpful!
The error message "ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)" indicates that the google-cloud-secret-manager
library is not installed or not properly installed on the Dataproc cluster. Despite your attempts to install the library using the initialization bash script and SSHing into the VM, the library is not being recognized by your PySpark job.
To resolve this issue, follow these enhanced steps:
Initialization Bash Script:
Verify Installation Command: Ensure the pip3 install --user google-cloud-secret-manager
command is correctly executed in the initialization bash script. Double-check for typos, especially in the library name.
Script Execution: Confirm that the initialization bash script is being invoked during cluster creation by reviewing the initialization action logs for any errors.
Script Permissions: Check the initialization bash script for the correct file permissions and ownership to ensure it can execute properly.
Python Environment:
Library Installation Path: Verify that the .local/lib/python<version>/site-packages
directory is included in the sys.path
. If not, add it manually.
Dataproc and Conda: If using Conda on Dataproc, activate the appropriate Conda environment before running the pip install
command.
Python Version: Confirm that the Python version used for the pip install
command matches the version used by PySpark.
Additional Considerations:
Dataproc Image Version: Use a Dataproc image version that supports the required Python version for google-cloud-secret-manager
.
Logging and Debugging: Implement logging within the initialization script to capture detailed output of the installation process for troubleshooting.
Dependency Conflicts: Check for any dependency conflicts that might be causing issues with importing the secretmanager
module.
Virtual Environments: Utilize Python virtual environments to manage dependencies and avoid system package conflicts.
Initialization Actions: If the script is complex, consider using multiple actions or leveraging community-provided initialization actions.
Google Cloud SDK: Ensure the Google Cloud SDK on the Dataproc cluster is up to date, as it includes tools for interacting with Google Cloud services.
Isolated Environment Testing: Test the installation and import process in an environment that closely replicates the Dataproc cluster's setup.
Service Account Scopes: Check that the service account attached to the Dataproc cluster has the correct scopes enabled for accessing the Secret Manager API.
Thank you for getting back to me so quickly! In response to your suggestions:
Initialization Bash Script:
Verify Installation Command: Ensure the pip3 install --user google-cloud-secret-manager command is correctly executed in the initialization bash script. Double-check for typos, especially in the library name.
It was my error in copying and pasting the initialization script - the code snippet is as you've written. I've updated the spelling in my original post
Script Execution: Confirm that the initialization bash script is being invoked during cluster creation by reviewing the initialization action logs for any errors.
I've gone ahead and checked the initialization action logs and saw that the installation of the library had gone through -
Successfully installed cachetools-5.3.2 google-api-core-2.12.0 google-auth-2.23.4 google-cloud-secret-manager-2.16.4 googleapis-common-protos-1.61.0 grpc-google-iam-v1-0.12.6 grpcio-1.59.2 grpcio-status-1.59.2 proto-plus-1.22.3 protobuf-4.25.0 pyasn1-0.5.0 pyasn1-modules-0.3.0 rsa-4.9
Script Permissions: Check the initialization bash script for the correct file permissions and ownership to ensure it can execute properly.
I actually wasn't quite sure how to do this step, unfortunately. My default Compute Engine does have permission rights to SecretManager - is this the same thing?
It's good to hear that the library installation was successful according to your logs. This suggests that the issue might not be with the installation itself but perhaps with the environment where your PySpark job is running.
Regarding script permissions, what you're looking for is whether the initialization script has the execute permission set, which it likely does if it ran and installed the packages. However, since you've confirmed that the installation was successful, this is probably not the issue.
The permissions related to your Compute Engine service account are different from the script file permissions. The service account permissions determine what resources and services your Dataproc cluster can access on Google Cloud Platform. In contrast, file permissions (like execute permissions for a bash script) are about what can be done with a file on the filesystem of the virtual machines in your cluster.
Since the library is installed, but your PySpark job cannot find it, here are a few additional steps to consider:
PySpark Environment: Ensure that the PySpark environment is using the same Python interpreter where the google-cloud-secret-manager
library was installed. If PySpark is using a different interpreter, it won't have access to the library. You can set the PYSPARK_PYTHON
environment variable to point to the correct interpreter.
Python Interpreter: When you SSH into the VM, you can check which Python interpreter is being used by default. Run which python
and which python3
to see the paths to the Python interpreters. Then, use pip list
or pip3 list
to see if the google-cloud-secret-manager
is listed there.
Dataproc Versions: Ensure that the version of Dataproc you are using is compatible with the google-cloud-secret-manager
library. Although the library is installed, there might be compatibility issues with certain versions of Dataproc.
Job Submission: When you submit your PySpark job, you can specify the Python version and packages with the --properties
flag. For example, you might need to set spark.pyspark.python
to the path of the Python interpreter that has the library installed.
Python Package Conflicts: It's possible that there are conflicting packages or versions that are causing issues. You might need to create a clean Python environment, install only the necessary packages, and then configure PySpark to use that environment.
Use Initialization Actions to Set Environment Variables: You can use initialization actions not only to install packages but also to set environment variables or perform other setup tasks that might be necessary for your PySpark job to run correctly.
If you've confirmed that the library is installed and accessible via the Python interpreter on the VM, but PySpark still can't find it, the issue is likely with how PySpark is configured to find Python packages. Adjusting the PYSPARK_PYTHON
environment variable or using a virtual environment for your PySpark job could resolve this.
Got it - thanks for the clarification! (And thank you for continuing to respond to me!) In regards to your reply -
PySpark Environment: Ensure that the PySpark environment is using the same Python interpreter where the google-cloud-secret-manager library was installed. If PySpark is using a different interpreter, it won't have access to the library. You can set the PYSPARK_PYTHON environment variable to point to the correct interpreter.
Well, as dumb as this sounds, I'm actually not really sure where the google-cloud-secret-manager ended up getting installed. 😅 You'll see what I mean in the next response -
Python Interpreter: When you SSH into the VM, you can check which Python interpreter is being used by default. Run which python and which python3 to see the paths to the Python interpreters. Then, use pip list or pip3 list to see if the google-cloud-secret-manager is listed there.
Oddly enough, though I hadn't specified conda on my dataproc cluster, it seems like conda is automatically set in the paths:
annabel:~$ which python
/opt/conda/default/bin/python
annabel:~$ which python3
/opt/conda/default/bin/python3
When looking at the system paths in general, it seems like the paths mostly point to conda -
annabel:~$ python -m site
sys.path = [
'/home/annabel',
'/opt/conda/default/lib/python310.zip',
'/opt/conda/default/lib/python3.10',
'/opt/conda/default/lib/python3.10/lib-dynload',
'/opt/conda/default/lib/python3.10/site-packages',
'/usr/lib/spark/python',
]
USER_BASE: '/home/annabel/.local' (doesn't exist)
USER_SITE: '/home/annabel/.local/lib/python3.10/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
And when using pip list and pip3 list, I don't see google-cloud-secret-manager in any of the lists.
annabel:~$ pip list
...
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4
...
annabel:~$ pip3 list
...
google-cloud-pubsub 2.13.12
google-cloud-redis 2.9.3
google-cloud-spanner 3.19.0
google-cloud-speech 2.15.1
google-cloud-storage 2.5.0
google-cloud-texttospeech 2.12.3
google-cloud-translate 3.8.4
google-cloud-vision 3.1.4
...
Dataproc Versions: Ensure that the version of Dataproc you are using is compatible with the google-cloud-secret-manager library. Although the library is installed, there might be compatibility issues with certain versions of Dataproc.
I guess given the results above, the library doesn't seem to be installed correctly. I double-checked the initialization code, and yes - the logs do indicate that it was successfully installed.
Downloading google_cloud_secret_manager-2.16.4-py2.py3-none-any.whl (116 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.6/116.6 kB 5.9 MB/s eta 0:00:00
Installing collected packages: google-cloud-secret-manager
Successfully installed google-cloud-secret-manager-2.16.4
Use Initialization Actions to Set Environment Variables: You can use initialization actions not only to install packages but also to set environment variables or perform other setup tasks that might be necessary for your PySpark job to run correctly.
Given that I'm using a secret API key, I think I should be using Secret Manager for this though, right? In the case I just can't get Dataproc running with this library though, I could also look into this as an alternative option.
Hi @annabel ,
Based on the information you've provided, it appears that the google-cloud-secret-manager
library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.
Installing the Library in the Conda Environment:
Activate the Conda Environment:
source activate <environment_name>
command. The conda activate
command is intended for interactive use and may not work in scripts.Install the Library:
google-cloud-secret-manager
library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
/opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager
Initialization Action Script:
PYSPARK_PYTHON
environment variable to point to the Python executable within the Conda environment.Dataproc Configuration:
spark.pyspark.python
and spark.pyspark.driver.python
properties to the path of the Python executable in the Conda environment.Testing the Installation:
google-cloud-secret-manager
library to ensure it's installed correctly.Using Secret Manager for API Keys:
Alternative Options:
Security Considerations:
Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.
After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)
Specifically, when creating a cluster via console:
Now, conda installs the library in the correct spot when the cluster is being spun up.
startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).
(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):
gcloud beta dataproc clusters create my-cluster \
--image-version=1.5 \
--region=${REGION} \
--optional-components=ANACONDA \
--properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Thanks again for your time in helping me to debug this! Your input was really helpful!
If by chance my earlier solution didn't help, these steps will also work:
--initialization-actions 'gs://{YOUR BUCKET}/pip-install.sh' --metadata PIP_PACKAGES=google-cloud-secret-manager==2.16.3
This should work as well.