Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Configuring a custom node pool service account for dataproc

In my gcloud environment, the default service accounts have been deleted for security posture reasons. This seems to pose an issue when I attempt to create a dataproc virtual cluster on GKE; I can find how to specify the service account that the pods will adopt when running dataproc jobs, but I can't seem to figure out where to specify the service account under which the node pool itself will run.

In gke, this is possible, of course (in console and in terraform code), but dataproc doesn't seem to allow use of a node pool it didn't specifically create.

Does anybody know how I might configure this when provisioning my dataproc on gke cluster?

0 4 567
4 REPLIES 4

Can any Google team member answer this question please ? Thanks so much 

Hi @matt-deboer,

Welcome to Google Cloud Community! 

The key is that you need to specify the node pool service account during the Dataproc cluster creation process. Dataproc provides configuration options to define the node pools it manages, and that includes the service account.

Make sure your service account must exist (ex. custom node service account: my-node-sa@my-project.iam.gserviceaccount.com), have the appropriate IAM roles, and the Dataproc Service Agent must have permission.

Here are some workarounds you can try:
1. Using `gcloud dataproc clusters gke create` (CLI)

  • The `--node-pool` flag is the primary way to configure node pools when creating a Dataproc on a GKE cluster using the `gcloud` command-line tool.
  • Within the `--node-pool` flag, you can use the `config.serviceAccount` property to specify the service account for the nodes in that pool. If you have a Google Kubernetes Engine cluster, you can update the node pool on this cli guide.
    gcloud dataproc clusters gke create CLUSTER_NAME \
            --project=PROJECT_ID \
            --region=REGION \
            --gke-cluster=GKE_CLUSTER_NAME \
            --node-pool="name=NODE_POOL_NAME,roles=default,locations=ZONE,config.machineType=MACHINE_TYPE,config.serviceAccount=CUSTOM_NODE_SA"

Replace:

  • `CLUSTER_NAME`: The name for your Dataproc cluster.
  • `PROJECT_ID`: Your GCP project ID.
  • `REGION`: The region to deploy the cluster.
  • `GKE_CLUSTER_NAME`: The name or full resource name of your existing GKE cluster.
  • `NODE_POOL_NAME`: A name for the Dataproc-managed node pool
  • `ZONE`: The zone where you want to create the nodes (must be a valid zone for the GKE cluster).
  • `MACHINE_TYPE`: The machine type to use for the nodes.
  • `CUSTOM_NODE_SA`: The *full* email address of the custom service account you want to use for the node pool (e.g., `my-node-sa@my-project.iam.gserviceaccount.com`).

2. Using Terraform

  • Terraform provides a `google_dataproc_cluster` resource that you can use to define your Dataproc on GKE cluster.  Within the `virtual_cluster_config`, you have `kubernetes_cluster_config` and `gke_cluster_config`.
  • The node pool configuration goes in `node_pool_config`, specifically under `node_pools`.
  • The service account is specified under `config.service_account`.

Example:

terraform

    resource "google_dataproc_cluster" "default" {
      project = "PROJECT_ID"
      region  = "REGION"
      name    = "CLUSTER_NAME"

      virtual_cluster_config {
        staging_bucket = "BUCKET_NAME"
        kubernetes_cluster_config {
          gke_cluster_config {
            gke_cluster_target = "GKE_CLUSTER_NAME"
            node_pool_config {
              node_pools {
                name = "NODE_POOL_NAME"
                roles = ["DEFAULT"]
                locations = ["ZONE"]
                config {
                  machine_type   = "MACHINE_TYPE"
                  service_account = "CUSTOM_NODE_SA"
                }
              }
            }
          }
        }
      }
    }

3. Required Permissions for Service Account
Custom Node Service Account (`CUSTOM_NODE_SA`) needs the necessary IAM roles to function as a GKE node, it typically needs:


If you have any questions and need further assistance with specific configurations, please reach out to our
Google Cloud Support team.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Thanks @reinc ; I'm attempting to apply your suggestions, but finding that the terraform example you provided doesn't seem to work with the most recent terraform google provider (https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataproc_cluster). If you look at the example they've provided, you'll find that `service_account` is not a valid property to be configured on `virtual_cluster_config.gke_cluster_config.node_pool_target.node_pool_config.config`, which is why I asked the original question.

Also, it looks like the documentation you referenced for the `gcloud dataproc clusters gke create` references an example argument '--node-pool` which is not supported by that command.
```ERROR: (gcloud.dataproc.clusters.gke.create) unrecognized arguments: --node-pool=... (did you mean '--pools'?)```

The closest argument available (as referenced [here](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/gke/create) ) is the `--pools` property, but once again that property does not support the `config.serviceAccount` property:
```ERROR: (gcloud.dataproc.clusters.gke.create) argument --pools: valid keys are [accelerators, bootDiskKmsKey, localSsdCount, locations, machineType, max, min, minCpuPlatform, name, preemptible, roles]; received: config.serviceAccount
```
Based on this, I'm wondering if you're somehow using different versions of these commands which support your suggested solution? Or is there something else you can suggest to try?

Thank you so much for your replay , since I have tried this configuration, in my case I have to have a policy to disable the default compute engine service account , would you please disable the default compte engine service account (delete it for example , by the way you can restore it until 30 days after the  deletion) and try the terraform apply please 

Top Labels in this Space