Solved: Composer pipeline to upload files from Google BigQ... - Page 2

ayushmaheshwari · 03-12-2024 10:19 AM

I have an s3 bucket (amazon) and I want to write a composer pipeline to upload csv files into the S3 bucket on a daily schedule. Currently all many data is in BigQuery and I'll convert it into csvs and will put it into an S3 bucket. Does anyone have any examples of doing this? Which is the easiest way to do this? I shall be grateful if someone can help CC: @ms4446

ms4446

For your scenario, where you're aiming to upload files to an Amazon S3 bucket from a Cloud Composer and you have a single user access requirement, you don't necessarily need to use a third-party tool like JumpCloud for identity access management. AWS IAM and Google Cloud's IAM, along with the concept of workload identity federation, can suffice for your needs.

Given your use case, here's a simplified approach without needing third-party identity providers:

AWS IAM Role for S3 Access: Create an IAM role in AWS with the necessary permissions to write to the specified S3 bucket. This role will be assumed by a Google Cloud service account through federation.
Workload Identity Federation: Set up workload identity federation between AWS and Google Cloud. This allows a Google Cloud service account to assume an AWS IAM role by obtaining temporary AWS security credentials.
Specifying the Provider URL: When setting up AWS IAM for federation with Google Cloud, AWS will ask for an issuer URL (the URL for the OpenID Connect provider). This URL is used by AWS to trust the identity provider (Google, in this case). For Google Cloud, the issuer URL is typically https://accounts.google.com.

Steps to Implement Workload Identity Federation

Create an AWS IAM Role:
- Navigate to the AWS IAM console.
- Create a new role and select "Web identity" as the type of trusted entity.
- For the identity provider, select "Google" and enter your Google Cloud service account’s unique ID.
- Attach policies that grant access to the necessary S3 resources.
Set Up Google Cloud Service Account:
- Ensure you have a Google Cloud service account that will interact with AWS services.
- Grant this service account the necessary roles in Google Cloud to perform its intended tasks.
Obtain Google Service Account Credentials:
- Use Google Cloud IAM to create and download a JSON key file for the service account. This key file will be used to authenticate and obtain tokens for exchange with AWS temp credentials.
Exchange Tokens for AWS Credentials:
- Implement the token exchange using Google's authentication libraries and AWS APIs.

Example

 

import boto3 

# Assuming you have obtained the Google ID token by authenticating with your service account
google_id_token = 'YOUR_GOOGLE_ID_TOKEN' 

# Assume the AWS role
sts_client = boto3.client('sts')
assumed_role_object = sts_client.assume_role_with_web_identity( 
    RoleArn="arn:aws:iam::AWS_ACCOUNT_ID:role/YOUR_AWS_ROLE", 
    RoleSessionName="SessionName", 
    WebIdentityToken=google_id_token 
) 

credentials = assumed_role_object['Credentials'] 

# Now you can use these temporary credentials to access AWS services 
s3_client = boto3.client( 
    's3', 
    aws_access_key_id=credentials['AccessKeyId'], 
    aws_secret_access_key=credentials['SecretAccessKey'], 
    aws_session_token=credentials['SessionToken']
) 

# Example: List buckets
response = s3_client.list_buckets()
print(response) 

If you're finding it challenging, consider the following resources:

Google Cloud Python Auth Library: Look into the Google Auth Library for Python (https://google-auth.readthedocs.io/en/stable/).
Google Cloud Documentation: The Authenticating as a service account page (https://cloud.google.com/iam/docs/creating-managing-service-accounts) offers detailed instructions on using service accounts for authentication.

View solution in original post

Composer pipeline to upload files from Google BigQuery to an S3 bucket