GCS to BigQuery Pipeline Failed

instantkit · 10-30-2024 11:10 AM

I'm building a pipeline to load data from my GCS bucket to BigQuery with wrangler transformation.

Here is Data Fusion instance details:

Edition : Developer
Version : 6.10.1(6.10.1.1)

I am already gave these role to the service account :

Cloud Data Fusion API Service Agent
Cloud Data Fusion Runner
Cloud Run Service Agent
Dataproc Worker

And this is my log of failed pipelines :

10/31/2024 0:43:02
ERROR
Spark Program 'phase-1' failed.
10/31/2024 0:43:02
ERROR
Spark program 'phase-1' failed with error: Application application_1730309833442_0002 finished with failed status. Please check the system logs for more details.
10/31/2024 0:43:03
ERROR
Pipeline 'defect-project-dept-data-pipeline' failed.
10/31/2024 0:43:03
ERROR
Workflow service 'workflow.default.defect-project-dept-data-pipeline.DataPipelineWorkflow.52a8dc23-96e5-11ef-ab12-c66d48daef21' failed.
10/31/2024 0:43:03
ERROR
Program DataPipelineWorkflow execution failed.
10/31/2024 0:43:05
WARN
Container container_1730309833442_0001_01_000002 exited abnormally with state COMPLETE, exit code 1.
10/31/2024 0:43:10
ERROR
Dataproc job 'FAILED' with the status details: Job failed with message [java.lang.reflect.InvocationTargetException: null]

Can anyone help me with this issue?
Thanks.

marckevin

Hi @instantkit,

Welcome to Google Cloud Community!

It seems you’re encountering several errors on your Datafusion Pipeline, which mostly general and requires further checking and might be due to several reasons. Here are some suggestions that may help resolve the issue:

Roles and Permissions: Ensure you have proper permission in executing your Data Fusion Pipeline for GCS Bucket to BigQuery. Check if your service account has the following roles: roles/bigquery.dataEditor to create a BigQuery dataset or table and roles/storage.objectViewer which grant access to view objects and their metadata in cloud storage bucket.
Full logs: Review and inspect the logs to pinpoint the exact cause of error. Check the detailed full logs specially error messages that suggest to check the system logs for more details. These errors might be due to several reasons including configuration issues, resource exhaustion, data loading, data skew, code and dependencies.
Wrangler Transformation: Wrangler is a visual data tool within Cloud Data Fusion. Ensure that the logic is correct and the transformation wrangler matches the data types and schema configuration with BigQuery.
Pipeline Configuration: Ensure your pipeline is properly configured, including connection of data sources, transformation and output sinks.
Resource Management: Ensure your pipeline driver and executor have enough resources to execute the job since it can sometimes fail with insufficient resources.
Code Review: Ensure you have reviewed your code since potential code error will directly impact the data processing.

If the issue persists, I recommend reaching out to Google Cloud Support for further assistance, as they can provide insights into whether this behavior is specific to your project.

I hope the above information is helpful.

ymr

Would it be possible to get yarn application logs for

application_1730309833442_0002