I'm trying to set up a Dataproc workflow where I extract data from one API and load it into BigQuery, but in order to do this, I need to get a secret api key from Secret Manager, and herein lies the problem.
I get the following error message, regardless of my attempts at installing secret manager into the dataproc cluster via initialization bash script or even SSHing into the VM to pip install the library.
ImportError: cannot import name 'secretmanager' from 'google.cloud' (unknown location)
I've tried googling the error, looking at Stackoverflow, reading GCP documentation and yet, the snippet above stays. Taunting me. For hours.
There was one thread that indicated that PySpark in general just can't work with Google Secret Manager (link: https://cloud.google.com/knowledge/kb/secretmanager-api-can-not-be-accessed-from-dataproc-job-000004...), but I'm hoping there's still a way here.
Here's my setup, in the case anyone could give me any helpful pointers.
1. Dataproc cluster:
#!/bin/bash
pip3 install --user google-cloud-secret-manager
2. Dataproc job:
In my Dataproc pyspark job, I have my library imports, one of which is:
from google.cloud import secretmanager
I've also tried importing the package as:
import google.cloud.secretmanager as secretmanager
But sadly, no, this does not work either.
Update: Updated initialization bash script - just realized that my c+p code for the python library initialization wasn't fully captured.
Solved! Go to Solution.
Hi @annabel ,
Based on the information you've provided, it appears that the google-cloud-secret-manager
library is not installed in the correct environment for PySpark to access it. Since PySpark is using a Conda environment, you need to ensure that the library is installed within that specific environment.
Installing the Library in the Conda Environment:
Activate the Conda Environment:
source activate <environment_name>
command. The conda activate
command is intended for interactive use and may not work in scripts.Install the Library:
google-cloud-secret-manager
library using the Conda environment's pip. You may need to specify the full path to the pip executable to avoid ambiguity, like so:
/opt/conda/envs/<env_name>/bin/pip install google-cloud-secret-manager
Initialization Action Script:
PYSPARK_PYTHON
environment variable to point to the Python executable within the Conda environment.Dataproc Configuration:
spark.pyspark.python
and spark.pyspark.driver.python
properties to the path of the Python executable in the Conda environment.Testing the Installation:
google-cloud-secret-manager
library to ensure it's installed correctly.Using Secret Manager for API Keys:
Alternative Options:
Security Considerations:
Thanks for your thorough feedback @ms4446 ! That makes a lot of sense. I really liked your first suggestion of Installing the Library in the Conda Environment, so I spent a lot more time looking into how to make this happen.
After trying various options, the solution was actually not via an initialization bash script, but rather simply, adding on an additional property when spinning up the dataproc cluster. (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties)
Specifically, when creating a cluster via console:
Now, conda installs the library in the correct spot when the cluster is being spun up.
startup-script[1140]: activate-component-miniconda3[3547]: 'install_conda_packages /opt/conda/miniconda3/bin google-cloud-secret-manager==2.16.3' succeeded after 1 execution(s).
(I haven't done this myself yet, but if one wanted to install the library via gcloud, you could follow the following pattern, which is also in the google documentation link above):
gcloud beta dataproc clusters create my-cluster \
--image-version=1.5 \
--region=${REGION} \
--optional-components=ANACONDA \
--properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Thanks again for your time in helping me to debug this! Your input was really helpful!