Howdy folks
I have an initialization script that uses pip to install a few Python modules needed by my user-defined functions in PySpark. I uploaded the script o a GCS bucket.
When I use the Google Cloud Console to create a Dataproc cluster, the script runs successfully on all of the nodes when the cluster starts up.
However, if I copy the "Equivalent Command Line" and paste that in to the Cloud Shell, the Initialization script fails on all nodes (including Manager node). The errors seem to have something to do with the network.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError( '<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f091cb4d110>: Failed to establish a new connection: [Errno 101] Network is unreachable')'
Here is the gcloud command line:
gcloud dataproc clusters create cluster-13a0 --enable-component-gateway --region us-central1 --master-machine-type n4-standard-4 --master-boot-disk-type hyperdisk-balanced --master-boot-disk-size 100 --num-workers 2 --worker-machine-type n4-standard-4 --worker-boot-disk-type hyperdisk-balanced --worker-boot-disk-size 100 --image-version 2.2-debian12 --optional-components JUPYTER --max-idle 7200s --initialization-actions 'gs://my-example-bucket/code/initialize_cluster.sh' --project MY-PROJECT-ID
I suspect that somehow, when using the command-line, the initialization script is running before the network and/or firewall is set up properly.
Any suggestions on how to work around this issue?
Thanks
R. H.