Hi, I'm just getting up and running with DataFlow and have been hitting some hiccups along the way. I have a cloud function that is using the `dataflow_v1beta3.FlexTemplateServiceClient` in Python to launch a flex template. I have launched this template using my own credentials from the CLI with a successful run, however when my cloud function successfully launches the template, the job fails with this logged:
Failed to read the result file : gs://dataflow-staging-northamerica-northeast2-31664930760/staging/template_launches/2024-02-13_13_48_32-17616911554794094215/operation_result with error message: (a91b614d0c0827f5): Unable to open template file: gs://dataflow-staging-northamerica-northeast2-31664930760/staging/template_launches/2024-02-13_13_48_32-17616911554794094215/operation_result..
Indeed, there is no file at this path, but I'm not sure what the issue would be.
Some context on how things are set up:
python apache beam pipeline code:
def run():
class MyPipelineOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument(
"--input-file",
required=True,
help="Input csv file to process.",
)
parser.add_argument(
"--output-table",
required=True,
help="Output table to write results to.",
)
options = MyPipelineOptions()
with beam.Pipeline(options=options) as p:
template manifest.json:
{
"name": "Airbyte GCS Raw Dump to BigQuery Template",
"description": "Takes raw data extracted to parquet format from Airbyte, transforms and loads it into BigQuery",
"parameters": [
{
"name": "input-file",
"label": "Input file",
"helpText": "The path to the parquet file in GCS",
"regexes": ["^gs:\\/\\/[^\\n\\r]+$"]
},
{
"name": "output-table",
"label": "Output table",
"helpText": "The name of the table to create in BigQuery",
"regexes": ["^[A-Za-z0-9_:.]+$"]
}
]
}
cloud function call to trigger the job:
client = dataflow_v1beta3.FlexTemplatesServiceClient()
parameters = {
"input-file": f"gs://{bucket_name}/{file_name}",
"output-table": output_table,
}
launch_parameter = dataflow_v1beta3.LaunchFlexTemplateParameter(
job_name=job_name,
container_spec_gcs_path=container_spec_gcs_path, # where manifest.json is located in GCS
parameters=parameters,
environment={
"service_account_email": dataflow_service_account,
"temp_location": temp_location,
},
)
request = dataflow_v1beta3.LaunchFlexTemplateRequest(
project_id=project_name,
location=region,
launch_parameter=launch_parameter,
)
response = client.launch_flex_template(request=request) # succeeds, but then the job fails
Solved! Go to Solution.
This error signifies a problem encountered by Google Cloud Dataflow when attempting to generate or access a crucial result file within Google Cloud Storage (GCS). This file is essential for detailing the execution outcomes of a Dataflow Flex Template job. The error can stem from various issues, including but not limited to:
container_spec_gcs_path
accurately points to the manifest.json
file and that the template is correctly formatted and valid.input-file
and output-table
) match the expected data types and constraints defined in your Flex Template is crucial.Troubleshooting Steps
Verify GCS Permissions:
Check File Paths:
container_spec_gcs_path
at the start, ensuring correctness.Validate Template File:
manifest.json
. Ensure that the parameters and their data types, along with any regex constraints, align with the data you're providing.Enable Robust Diagnostics:
Check Job Resources:
Retry with Monitoring:
Contact Google Cloud Support:
This error signifies a problem encountered by Google Cloud Dataflow when attempting to generate or access a crucial result file within Google Cloud Storage (GCS). This file is essential for detailing the execution outcomes of a Dataflow Flex Template job. The error can stem from various issues, including but not limited to:
container_spec_gcs_path
accurately points to the manifest.json
file and that the template is correctly formatted and valid.input-file
and output-table
) match the expected data types and constraints defined in your Flex Template is crucial.Troubleshooting Steps
Verify GCS Permissions:
Check File Paths:
container_spec_gcs_path
at the start, ensuring correctness.Validate Template File:
manifest.json
. Ensure that the parameters and their data types, along with any regex constraints, align with the data you're providing.Enable Robust Diagnostics:
Check Job Resources:
Retry with Monitoring:
Contact Google Cloud Support:
Thank you! I am pretty sure it's a permissions thing... From my cloud function, I was trying to specify the SA for dataflow with this line:
environment={
"service_account_email": dataflow_service_account,
}
In this case, the service account is one that I created and gave what I thought were the necessary permissions based on this doc. I am not sure that this service account is actually being used when the template is launched, though. I don't see any logs around denied permissions, just that the staging/temp console logs were not being written to the temp/staging buckets.
I was able to get the dataflow client in my cloud function to successfully launch the dataflow job after making two changes: 1. remove the "service_account_email" parameter from the request and 2. give my cloud function SA the service account user role for the compute default SA. However, even when I do this, the first step of my pipeline (reading a file on a GCS bucket) hangs indefinitely.
One final thing: I am able to run the script in the cloud function from my own computer using my ADC. It launches the dataflow job, and every step of the dataflow job succeeds.
Dataflow jobs default to using the Compute Engine default service account unless an alternative is specified via the service_account_email
parameter. This default behavior is crucial for setting up the correct permissions.
Removing the service_account_email
parameter defaults your Dataflow job to use the Compute Engine default service account. This simplifies execution but necessitates ensuring this account has the appropriate permissions.
The absence of explicit "permission denied" logs requires a thorough examination of both Cloud Function and Dataflow logs. Subtle hints or indirect messages might indicate access issues, necessitating meticulous log analysis.
A Dataflow job hanging during a GCS read operation typically points to insufficient permissions for the Compute Engine default service account. To address this:
Discrepancy Between Cloud Function and Local Execution The difference in behavior between local execution using Application Default Credentials (ADC) and Cloud Function execution underscores the importance of consistent permissions. ADCs often have broader access, facilitating local success.
Additional Considerations
Debugging Suggestions
Thanks again! In terms of the step that was hanging, we examined the arguments passed to it and noticed that the delay happened when we specfied `sdk_container_image` in the arguments. It's possible we are misunderstanding how to use this argument. We were able to make everything work by using default compute engine SA and not specifying that parameter. We are getting the warning that "SDK worker container image pre-building: can be enabled" when running our jobs now, but at least they are succeeding at this point. We're finding it challenging to piece together all the relevant docs around packaging/ running flex template + custom docker image + pre-installing dependencies + python. I've found docs that do some of these things but not all together.
I would definitely prefer to use a specialized service account when running these jobs, since it seems to break the "minimum permissions to do the job" rule when our cloud function SA is able to use the compute default SA and we'll need to ultimately add more roles to the default compute SA as some of our dataflow jobs will require additional permissions, e.g. secret manager.
At any rate, we're unblocked from developing the pipelines for the time being and we can do improved permissions/ pre-build the SDK stuff as a follow-up in the future. Ultimately, we were having a permissions issue with the original question
Just returning to this thread because we recently encountered a lot of job failures from quota limits that maybe have been related to how we weren't pre-installing external dependencies (slow startup times and high CPU usage while installing deps). I ended up addressing the warnings we were getting and now the jobs are passing again for the time being. I mentioned here that I had a hard time finding documentation around our specific use case, but I wanted to point to an example repo that I found helpful for anyone else who comes across this post, also struggling with the packaging/ running flex template + custom docker image + pre-installing dependencies scenario on a python project:
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipelin...