Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Initialization script works in Dataproc Console but not from gcloud command line

Howdy folks

I have an initialization script that uses pip to install a few Python modules needed by my user-defined functions in PySpark. I uploaded the script o a GCS bucket.

When I use the Google Cloud Console to create a Dataproc cluster, the script runs successfully on all of the nodes when the cluster starts up.

However, if I copy the "Equivalent Command Line" and paste that in to the Cloud Shell, the Initialization script fails on all nodes (including Manager node).   The errors seem to have something to do with the network.

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError( '<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f091cb4d110>: Failed to establish a new connection: [Errno 101] Network is unreachable')'

Here is the gcloud command line:

gcloud dataproc clusters create cluster-13a0 --enable-component-gateway --region us-central1 --master-machine-type n4-standard-4 --master-boot-disk-type hyperdisk-balanced --master-boot-disk-size 100 --num-workers 2 --worker-machine-type n4-standard-4 --worker-boot-disk-type hyperdisk-balanced --worker-boot-disk-size 100 --image-version 2.2-debian12 --optional-components JUPYTER --max-idle 7200s --initialization-actions 'gs://my-example-bucket/code/initialize_cluster.sh' --project MY-PROJECT-ID

I suspect that somehow, when using the command-line, the initialization script is running before the network and/or firewall is set up properly.

Any suggestions on how to work around this issue?

Thanks

R. H.

0 3 233
3 REPLIES 3

Hi @rholowczak,

Welcome to Google Cloud Community!

Have you tried checking Troubleshoot cluster creation issues? This provides steps to diagnose networking issues.

If this will not work, I recommend submitting an issue report regarding this so that our Engineering Team can look into it. Before filing, please take note on what to expect when opening an issue. For further clarification, you may reach out to Google Cloud Support for a one-on-one discussion.

Note: Provide detailed information and relevant screenshots to make it easier for them to solve your issue.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Thanks for the response. I ran the gcpdiag runbook and it basically told me the same thing as what I got from the error logs (see output below).  

The initialization script failed.  I have been reading through the suggested tips on initialization scripts: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions#important_consider...

However, I cannot seem to find which tip is the most relevant.  I am not pulling any code from github. I am just running pip install .....    

There is a note about trying to set the dataproc.master.custom.init.actions.mode cluster property to RUN_AFTER_SERVICES. However this only seems to apply to the master (manager) node so it would not help the worker (core) nodes.

Again, it does not make sense that this works fine when using the Console but fails with the exact command line that is suggested by the console.

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.con
nection.HTTPSConnection object at 0x7f7af466fa50>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/textblob/
ERROR: Could not find a version that satisfies the requirement textblob (from versions: none)
ERROR: No matching distribution found for textblob

Unfortunately, I am on an Educational Account and I am not able to open a Support Ticket.

Thanks again for the suggestions.

===================================================

professorholowczak@cloudshell:~ (my-gcp-project-123456)$ gcpdiag runbook dataproc/cluster-creation --parameter project_id=my-gcp-project-123456 --parameter cluster_name=cluster-abcde --parameter internal_ip_only=false --parameter region=us-central1
ERROR: can't import module: No module named 'dns'
ERROR: can't import module: No module named 'dns'
ERROR: can't import module: No module named 'dns'
ERROR: can't import module: No module named 'dns'
Starting runbook inspection [Alpha Release]

gcpdiag 0.78

dataproc/cluster-creation: Provides a comprehensive analysis of common issues which affect Dataproc cluster creation.

This runbook focuses on a range of potential problems for Dataproc clusters on
Google Cloud Platform. By conducting a series of checks, the runbook aims to
pinpoint the root cause of cluster creation difficulties.

The following areas are examined:

- Stockout errors: Evaluates Logs Explorer logs regarding stockout in the
region/zone.

- Quota availability: Checks for the quota availability in Dataproc cluster project.

- Network configuration: Performs GCE Network Connectivity Tests, checks necessary firewall rules, external/internal IP configuration.

- Cross-project configuration: Checks if the service account is not in the same
project and reviews additional
roles and organization policies enforcement.

- Shared VPC configuration: Checks if the Dataproc cluster uses a Shared VPC network and
evaluates if right service account roles are added.

- Init actions script failures: Evaluates Logs Explorer
logs regarding init actions script failures or timeouts.

[START]: Verify cluster quota.

- my-gcp-project-123456 [OK]
[REASON]
No issues with insufficient quota in project my-gcp-project-123456 has been identified for the ivestigated cluster cluster-abcde, please double-check if you have provided
the right cluster_name parameter if the cluster you are trying to create doesn't appear in Dataproc UI.

[AUTOMATED STEP]: Verify cluster stockout issue.

- my-gcp-project-123456 [OK]
[REASON]
No issues with stockouts and insufficient resources in project my-gcp-project-123456 has been identified for cluster-abcde, please double-check if you have provided
the right cluster_name parameter if the cluster you are trying to create doesn't appear in Dataproc UI.

[AUTOMATED STEP]: Verify cluster exists in Dataproc UI.

- my-gcp-project-123456/us-central1/cluster-abcde [OK]
[REASON]
Cluster cluster-abcde exists in project projects/my-gcp-project-123456

[GATEWAY]: Verify cluster is in ERROR state.
[INFO]: Cluster is in ERROR state or not existing and additional parameters has been provided
[AUTOMATED STEP]: Gathering cluster details.

- my-gcp-project-123456/us-central1/cluster-abcde [OK]
[REASON]
Stackdriver: Enabled

[INFO]: Service Account:123456789123-compute@developer.gserviceaccount.com
[INFO]: Network: https://www.googleapis.com/compute/v1/projects/my-gcp-project-123456/global/networks/default
[AUTOMATED STEP]: Verify network connectivity among nodes in the cluster.
[INFO]: Zone: us-central1-c
[INFO]: Running connectivity tests.
[INFO]: ICMP test.
[INFO]: Connectivity test result: REACHABLE
[INFO]: TCP test.
[INFO]: Connectivity test result: REACHABLE
[INFO]: UDP test.
[INFO]: Connectivity test result: REACHABLE

- my-gcp-project-123456/us-central1/cluster-abcde [OK]
[REASON]
The network communication among nodes in cluster cluster-abcde is working.

[GATEWAY]: Checking if the cluster is using internal IP only.
[INFO]: Internal IP only: True

- my-gcp-project-123456/us-central1/cluster-abcde [OK]
[REASON]
Subnetwork: https://www.googleapis.com/compute/v1/projects/my-gcp-project-123456/regions/us-central1/subnetworks/default
[AUTOMATED STEP]: Checking if the subnetwork of the cluster has private google access enabled.

- my-gcp-project-123456/us-central1/cluster-abcde [OK]
[REASON]
Google Private Access in subnet: https://www.googleapis.com/compute/v1/projects/my-gcp-project-123456/regions/us-central1/subnetworks/default is enabled.

[GATEWAY]: Checking service account project.
[INFO]: 123456789123-compute@developer.gserviceaccount.com
[INFO]: VM Service Account associated with Dataproc cluster was found in the same project
[INFO]: Checking permissions.
[AUTOMATED STEP]: Verify that serviceAccount:123456789123-compute@developer.gserviceaccount.com has required permissions/roles in project/my-gcp-project-123456.
[WARNING] Using "-" wildcard to infer host project for service account: service-123456789123@trifacta-gcloud-prod.iam.gserviceaccount.com. Rules which rely on method: projects.serviceAccounts.get to determine disabled vrs deleted status of service-123456789123@trifacta-gcloud-prod.iam.gserviceaccount.com may produce misleading results. See: https://cloud.google.com/iam/docs/reference/rest/v1/projects.serviceAccounts/get
[WARNING] can't retrieve service account service-123456789123@trifacta-gcloud-prod.iam.gserviceaccount.com belonging to project - but used in project: my-gcp-project-123456

- projects/my-gcp-project-123456 [OK]
[REASON]
serviceAccount:123456789123-compute@developer.gserviceaccount.com has expected roles.
roles/dataproc.worker.

[AUTOMATED STEP]: Verify service account roles based on Shared VPC.

- my-gcp-project-123456 [SKIP]
[REASON]
Cluster is not using a Shared VPC network
[AUTOMATED STEP]: Verify Cluster init script failure.

- my-gcp-project-123456 [FAIL]
[REASON]
The cluster cluster-abcde creation failed because the initialization script encountered an error.

[REMEDIATION]
A Dataproc cluster init script failure means that a script intended to run during the cluster's initial setup did not complete successfully.
Solution:
See initialization actions considerations and guidelines [1].
Examine the output logs. The error message should provide a link to the logs in Cloud Storage.
[1]<https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions#important_considerations_and_guidelines>

[Choose the next action]:

[r] - Retest current step
[c] - Continue
[s] - Stop Runbook

Choose an option: C
[END]: This is the end step of the runbook.
[INFO]: Please visit all the FAIL steps and address the suggested remediations.
If the cluster is still not able to be provisioned successfully,
run the runbook again and open a Support case. If you are missing
Service Account permissions, but are not able to see the Service Agent
Service Account go to the IAM page and check 'Include Google-provided
role grants'

Runbook report located in: /tmp/gcpdiag_runbook_report_dataproc_cluster-creation_5c49_6af9_2025_04_28_14_43_07_UTC.json
Rules summary: 1 skipped, 8 ok, 1 failed, 0 uncertain

 

Hi again

I believe I have narrowed the problem down to a networking issue.

Through the GCP Console, when you uncheck the option: "Configure all instances to have only internal IP addresses", then all of the nodes will use the VPC and allow connections to external hosts. Note that this option default was changed in Fall 2024 to always be checked by default.

However, those settings are not reflected in the equivalent gcloud command line. I am at a loss as to how to get the cluster to communicate with the outside world when created form the command line.

The Dataproc diagnostic shows:

[GATEWAY]: Checking if the cluster is using internal IP only.
[INFO]: Internal IP only: True

As far as I can tell there is no opposite of the --internal-ip setting.

I have  VPC network called 'default' that I have used for VMs for over a year.

In my gcloud command line I have tried using --subnet default and --network default.

Neither of these permit the master or core nodes to communicate to hosts outside of the VPC.

The only workaround I have found is to stage the Python modules' wheel files on my cloud storage bucket. Then, on cluster initialization, copy those wheel files to the /tmp folder and run the install locally.