Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

GCS to BigQuery Pipeline Failed

I'm building a pipeline to load data from my GCS bucket to BigQuery with wrangler transformation.

instantkit_0-1730310731954.png

Here is Data Fusion instance details:

  • Edition : Developer
  • Version : 6.10.1(6.10.1.1)

I am already gave these role to the service account : 

  • Cloud Data Fusion API Service Agent
  • Cloud Data Fusion Runner
  • Cloud Run Service Agent
  • Dataproc Worker

And this is my log of failed pipelines : 

10/31/2024 0:43:02
ERROR
Spark Program 'phase-1' failed.
10/31/2024 0:43:02
ERROR
Spark program 'phase-1' failed with error: Application application_1730309833442_0002 finished with failed status. Please check the system logs for more details.
10/31/2024 0:43:03
ERROR
Pipeline 'defect-project-dept-data-pipeline' failed.
10/31/2024 0:43:03
ERROR
Workflow service 'workflow.default.defect-project-dept-data-pipeline.DataPipelineWorkflow.52a8dc23-96e5-11ef-ab12-c66d48daef21' failed.
10/31/2024 0:43:03
ERROR
Program DataPipelineWorkflow execution failed.
10/31/2024 0:43:05
WARN
Container container_1730309833442_0001_01_000002 exited abnormally with state COMPLETE, exit code 1.
10/31/2024 0:43:10
ERROR
Dataproc job 'FAILED' with the status details: Job failed with message [java.lang.reflect.InvocationTargetException: null]

Can anyone help me with this issue?
Thanks.

0 2 298
2 REPLIES 2

Hi @instantkit,

Welcome to Google Cloud Community!

It seems you’re encountering several errors on your Datafusion Pipeline, which mostly general and requires further checking and might be due to several reasons. Here are some suggestions that may help resolve the issue:

  • Roles and Permissions: Ensure you have proper permission in executing your Data Fusion Pipeline for GCS Bucket to BigQuery. Check if your service account has the following roles: roles/bigquery.dataEditor to create a BigQuery dataset or table and roles/storage.objectViewer which grant access to view objects and their metadata in cloud storage bucket.
  • Full logs: Review and inspect the logs to pinpoint the exact cause of error. Check the detailed full logs specially error messages that suggest to check the system logs for more details. These errors might be due to several reasons including configuration issues, resource exhaustion, data loading, data skew, code and dependencies. 
  • Wrangler Transformation: Wrangler is a visual data tool within Cloud Data Fusion. Ensure that the logic is correct and the transformation wrangler matches the data types and schema configuration with BigQuery.
  • Pipeline Configuration: Ensure your pipeline is properly configured, including connection of data  sources, transformation and output sinks.
  • Resource Management: Ensure your pipeline driver and executor have enough resources to execute the job since it can sometimes fail with insufficient resources.
  • Code Review: Ensure you have reviewed your code since potential code error will directly impact the data processing.

If the issue persists, I recommend reaching out to Google Cloud Support for further assistance, as they can provide insights into whether this behavior is specific to your project.

I hope the above information is helpful.

 

ymr
Bronze 1
Bronze 1

Would it be possible to get yarn application logs for 

application_1730309833442_0002