Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Pub/Sub Publish message from Dataproc cluster using Python: ACCESS_TOKEN_SCOPE_INSUFFICIENT

Hello, 


I have a problem publishing a pub/sub message from the Dataproc cluster, from Cloud Function it works well with a service account, but with Dataproc I got this error: 

raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.PermissionDenied: 403 Request had insufficient authentication scopes. [reason: "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
domain: "googleapis.com"
metadata {
  key: "method"
  value: "google.pubsub.v1.Publisher.Publish"
}
metadata {
  key: "service"
  value: "pubsub.googleapis.com"
}
]

The service account assigned to this cluster suppose to have Pub/Sub publisher but the error above appears.

There is a workaround I have done to sort this issue, which is to use the service account key (.json) file to publish but I believe it is a bad practice as the secrets (private key) are exposed and can be read from the code, I tried to use the secret manager, but again there is no access from the cluster, same error when publishing to pub/sub (403) 

That's how I get the cluster to publish pub/sub topic 

service_account_credentials = {"""  hidden for security reasons lol """} 

credentials = service_account.Credentials.from_service_account_info(
service_account_credentials)

The code to publish 

class EmailPublisher:

def __init__(self, project_id: str, topic_id: str, credentials):
    self.publisher = pubsub_v1.PublisherClient(credentials=credentials)
    self.topic_path = self.publisher.topic_path(project_id, topic_id)

def publish_message(self, message: str):
    data = str(message).encode("utf-8")
    future = self.publisher.publish(self.topic_path, data, origin="dataproc-python-pipeline", username="gcp")
   
logging.info(
future.result())
    logging.info("Published messages with custom attributes to %s", self.topic_path)

Is there any solution to make the Dataproc cluster read the service account and have permission to access GCP's services 

Thank you,
Solved Solved
1 4 1,506
1 ACCEPTED SOLUTION

No worries. Let me clarify:

1. master_config.yaml is not a specific file in Google Cloud Dataproc. It was used as an example. In reality, you would specify the service account and scopes when you create the cluster, either through the Google Cloud Console, the gcloud command-line tool, or the Dataproc API.

2. To create a cluster with a specific service account and scopes using the gcloud command-line tool, you would use a command like this:

gcloud dataproc clusters create my-cluster --scopes=https://www.googleapis.com/auth/pubsub,https://www.googleapis.com/auth/cloud-platform --service-account=<service-account-email>

In the Python SDK, you would specify the service account and scopes when you create the cluster. Here's an example:

from google.cloud import dataproc_v1

cluster_client = dataproc_v1.ClusterControllerClient(client_options={
'api_endpoint': '{}-dataproc.googleapis.com:443'.format('us-central1')
})

cluster_data = {
'project_id': 'my-project',
'cluster_name': 'my-cluster',
'config': {
'gce_cluster_config': {
'service_account': '<service-account-email>',
'service_account_scopes': [
'https://www.googleapis.com/auth/pubsub',
'https://www.googleapis.com/auth/cloud-platform'
]
}
}
}operation = cluster_client.create_cluster('my-project', 'us-central1', cluster_data)

result = operation.result()

3. Once the cluster is created with the correct service account and scopes, your Python code running on the cluster should automatically use the service account. If you're using the Google Cloud Client Libraries, they will automatically use the service account associated with the environment.

4. If your Dataproc cluster is not able to access Secret Manager, it's likely because the cluster was not created with the necessary scopes. You can add the Secret Manager scope (https://www.googleapis.com/auth/cloud-platform) when you create the cluster to give it access to Secret Manager.

5. The --service-account flag in the gcloud command and the service_account field in the Python SDK are not deprecated. They are used to specify the service account that the cluster should use. However, just specifying the service account is not enough - you also need to ensure that the service account has the necessary IAM roles, and that the cluster is created with the necessary scopes.

View solution in original post

4 REPLIES 4

There are a couple of approaches you can take to enable the Dataproc cluster to access the service account and thereby gain permissions to interact with GCP's services.

One approach involves using the gcloud command-line tool to generate a service account key file and then attaching this file to the Dataproc cluster. You can create the service account key file by executing the following command:

gcloud iam service-accounts keys create <key-file-path> --iam-account=<service-account-email>

After creating the service account key file, you can attach it to the Dataproc cluster by modifying the master_config.yaml file. In this file, you need to append the following line to the container_definitions section:

- name: my-service-account
volumeMounts:
- mountPath: /var/secrets/google
name: service-account-key
subPath: key.json

Here, my-service-account is the name of the service account you created earlier. The /var/secrets/google mount path is the location on the Dataproc cluster where the service account key file will be attached. key.json is the name of the service account key file.

After modifying the master_config.yaml file, you can create the Dataproc cluster. The cluster will then be able to access the service account key file from the Secret Manager and gain permissions to interact with GCP's services.

Another approach is to utilize the Secret Manager to store the service account key file. In this method, you would create a secret in the Secret Manager and then grant the Dataproc cluster access to this secret.

To create a secret in the Secret Manager, execute the following command:

gcloud secrets create <secret-name> --data-file=<path-to-key-file>

After creating the secret, you need to add the following line to the master_config.yaml file:

- name: my-secret
secretRef:
name: <secret-name>
key: key.json

In this scenario, my-secret is the name of the secret you created earlier. key.json is the name of the service account key file stored in the secret.

After modifying the master_config.yaml file, you can create the Dataproc cluster. The cluster will then be able to access the service account key file from the Secret Manager and gain permissions to interact with GCP's services.


I am sorry for being too noob, I am a bit confused and I have a couple of questions:

1. master_config.yaml where does it exist?

2. How to create a cluster with master_config.yaml using gcloud command and python SDK in Cloud functions?

3. Eventually from Dataproc cluster, after creation, wirting python code, how to read that service_account?

4. In the second appraoch, in my case the dataproc cluster is not being able to access Secret Manager, due to a permissions issue

5. with gcloud commands and python SDK we have --service_account: attribute, but it doesn't give permission to the cluster, is it deprecated?

No worries. Let me clarify:

1. master_config.yaml is not a specific file in Google Cloud Dataproc. It was used as an example. In reality, you would specify the service account and scopes when you create the cluster, either through the Google Cloud Console, the gcloud command-line tool, or the Dataproc API.

2. To create a cluster with a specific service account and scopes using the gcloud command-line tool, you would use a command like this:

gcloud dataproc clusters create my-cluster --scopes=https://www.googleapis.com/auth/pubsub,https://www.googleapis.com/auth/cloud-platform --service-account=<service-account-email>

In the Python SDK, you would specify the service account and scopes when you create the cluster. Here's an example:

from google.cloud import dataproc_v1

cluster_client = dataproc_v1.ClusterControllerClient(client_options={
'api_endpoint': '{}-dataproc.googleapis.com:443'.format('us-central1')
})

cluster_data = {
'project_id': 'my-project',
'cluster_name': 'my-cluster',
'config': {
'gce_cluster_config': {
'service_account': '<service-account-email>',
'service_account_scopes': [
'https://www.googleapis.com/auth/pubsub',
'https://www.googleapis.com/auth/cloud-platform'
]
}
}
}operation = cluster_client.create_cluster('my-project', 'us-central1', cluster_data)

result = operation.result()

3. Once the cluster is created with the correct service account and scopes, your Python code running on the cluster should automatically use the service account. If you're using the Google Cloud Client Libraries, they will automatically use the service account associated with the environment.

4. If your Dataproc cluster is not able to access Secret Manager, it's likely because the cluster was not created with the necessary scopes. You can add the Secret Manager scope (https://www.googleapis.com/auth/cloud-platform) when you create the cluster to give it access to Secret Manager.

5. The --service-account flag in the gcloud command and the service_account field in the Python SDK are not deprecated. They are used to specify the service account that the cluster should use. However, just specifying the service account is not enough - you also need to ensure that the service account has the necessary IAM roles, and that the cluster is created with the necessary scopes.

Thank you so much for your support and patience.