Trying to run a few simple CSV -> BigQuery jobs and the jobs keep failing with the error
Spark program 'phase-1' failed with error: Application application_1709039989238_0002 finished with failed status. Please check the system logs for more details.
Looking at the system logs there are various errors associated with connections to messaging services such as this Wrangler log message
DEBUG [MessagingMetricsCollectionService:i.c.c.m.c.MessagingMetricsCollectionService@175] - Failed to publish metrics to TMS due to Service 'messaging.service' is not available. Please wait until it is up and running.. Will be retried in 1000 ms.
In the runtime logs there ar e numerous Connection Refused errors associated with the AbstractMessagingPollingService.
Any ideas how to fix this?
The 'phase-1' error means there's a problem within the Spark program processing your data. We need to check the logs to pinpoint the exact issue. The message about the 'messaging.service' indicates some difficulty in sending performance data. This might be connected to the main problem or a separate hiccup.Those 'Connection Refused' errors are a strong sign that your Data Fusion instance is having difficulty communicating with the other systems it needs for processing.
Here are steps to troubleshoot:
1. Check Instance Status
Navigate to the Cloud Data Fusion console to ensure your instance is in the "RUNNING" state. If it's not, start it and check the instance's health status for any errors related to internal services.
2. Firewall Rules
Verify there are no firewall rules blocking connections between your Cloud Data Fusion instance and the underlying Dataproc cluster used for processing. Ensure that your Google Cloud project's firewall settings allow for necessary communications.
3. Networking
DNS: Confirm that your Cloud Data Fusion instance can properly resolve the DNS names of internal GCP services.
Virtual Private Cloud (VPC) Peering: If your Data Fusion instance operates in a separate VPC, ensure there's proper VPC peering configured to facilitate communication.
4. System Logs
Investigate the Data Fusion instance system logs and Dataproc job logs for more specific errors. These logs can provide clearer insights into the failure points. Access these logs via the Google Cloud Console or the gcloud
command-line tool.
5. Resource Allocation
Review and possibly increase the resources (CPU, memory) allocated to your Data Fusion instance or the specific job, especially if your CSV to BigQuery jobs are large or complex.
6. Plugin and Dependency Checks
If you're using custom plugins or specific connectors (e.g., CSV to BigQuery), ensure they are correctly configured and compatible with your version of Data Fusion.
7. Google Cloud Support
If the issue persists after these troubleshooting steps, consider reaching out to Google Cloud Support for a more in-depth analysis of your Data Fusion instance and its networking environment.
Please Note: Error messages like these often point to fundamental networking or configuration problems within your Google Cloud environment. Addressing these issues can be complex and may require a systematic approach to troubleshooting.
Still no joy. It looks like everything is enabled and running. Looking at the errors, there are a number of 504 errors in the Cloud Dataproc Control API.
A 504 error from the Cloud Dataproc Control API means your Data Fusion instance is waiting too long for responses from Dataproc. This could be due to:
Troubleshooting Steps
Dataproc Health Check
Network Scrutiny
Timeout Settings
Data Fusion Logs
Advanced Troubleshooting
Scale Dataproc (if needed): If Dataproc seems to be constantly hitting resource limits, temporarily adding more nodes or more powerful nodes might help while you fix the underlying issue.
Contact Google Cloud Support: If the problem persists, provide Google Cloud Support with detailed logs, configuration info, and the steps you've already tried.
Additional Considerations