Errors running Cloud Data Fusion jobs - cant connect to message queue

Trying to run a few simple CSV -> BigQuery jobs and the jobs keep failing with the error

Spark program 'phase-1' failed with error: Application application_1709039989238_0002 finished with failed status. Please check the system logs for more details.

Looking at the system logs there are various errors associated with connections to messaging services such as this Wrangler log message

DEBUG [MessagingMetricsCollectionService:i.c.c.m.c.MessagingMetricsCollectionService@175] - Failed to publish metrics to TMS due to Service 'messaging.service' is not available. Please wait until it is up and running.. Will be retried in 1000 ms.

In the runtime logs there ar e numerous Connection Refused errors associated with the AbstractMessagingPollingService.

Any ideas how to fix this?

 

0 3 217
3 REPLIES 3

The 'phase-1' error means there's a problem within the Spark program processing your data. We need to check the logs to pinpoint the exact issue. The message about the 'messaging.service' indicates some difficulty in sending performance data. This might be connected to the main problem or a separate hiccup.Those 'Connection Refused' errors are a strong sign that your Data Fusion instance is having difficulty communicating with the other systems it needs for processing.

Here are steps to troubleshoot:

1. Check Instance Status

  • Navigate to the Cloud Data Fusion console to ensure your instance is in the "RUNNING" state. If it's not, start it and check the instance's health status for any errors related to internal services.

2. Firewall Rules

  • Verify there are no firewall rules blocking connections between your Cloud Data Fusion instance and the underlying Dataproc cluster used for processing. Ensure that your Google Cloud project's firewall settings allow for necessary communications.

3. Networking

  • DNS: Confirm that your Cloud Data Fusion instance can properly resolve the DNS names of internal GCP services.

  • Virtual Private Cloud (VPC) Peering: If your Data Fusion instance operates in a separate VPC, ensure there's proper VPC peering configured to facilitate communication.

4. System Logs

  • Investigate the Data Fusion instance system logs and Dataproc job logs for more specific errors. These logs can provide clearer insights into the failure points. Access these logs via the Google Cloud Console or the gcloud command-line tool.

5. Resource Allocation

  • Review and possibly increase the resources (CPU, memory) allocated to your Data Fusion instance or the specific job, especially if your CSV to BigQuery jobs are large or complex.

6. Plugin and Dependency Checks

  • If you're using custom plugins or specific connectors (e.g., CSV to BigQuery), ensure they are correctly configured and compatible with your version of Data Fusion.

7. Google Cloud Support

  • If the issue persists after these troubleshooting steps, consider reaching out to Google Cloud Support for a more in-depth analysis of your Data Fusion instance and its networking environment.

Please Note: Error messages like these often point to fundamental networking or configuration problems within your Google Cloud environment. Addressing these issues can be complex and may require a systematic approach to troubleshooting.

Still no joy.  It looks like everything is enabled and running.  Looking at the errors, there are a number of 504 errors in the Cloud Dataproc Control API. 

A 504 error from the Cloud Dataproc Control API means your Data Fusion instance is waiting too long for responses from Dataproc. This could be due to:

  • Dataproc Resource Constraints: If Dataproc lacks enough CPU, memory, or disk resources, it might become slow and cause timeouts.
  • Network Performance Issues: Bottlenecks or delays in the network can slow down communication between Data Fusion and Dataproc.
  • Configuration Misalignments: Incorrect settings can introduce delays in how Data Fusion and Dataproc interact.

Troubleshooting Steps

  1. Dataproc Health Check

    • Check cluster health: In the Google Cloud Console, look for any errors or warnings related to your Dataproc clusters.
    • Monitor resources: Look for signs of overloaded resources (CPU, memory, disk I/O) that could be slowing Dataproc down.
  2. Network Scrutiny

    • Check firewall rules: Make sure firewall rules allow communication between Data Fusion and Dataproc.
    • Analyze network performance: Use tools like Google Cloud's Network Intelligence Center to identify any unusual latency or bandwidth problems.
  3. Timeout Settings

    • Adjust API timeouts: If Dataproc API calls are timing out, consider increasing the timeout values in your Data Fusion configuration.
  4. Data Fusion Logs

    • Examine logs: Search Data Fusion logs around the time of the 504 errors. Pinpoint the specific Dataproc API calls that are failing.

Advanced Troubleshooting

  1. Scale Dataproc (if needed): If Dataproc seems to be constantly hitting resource limits, temporarily adding more nodes or more powerful nodes might help while you fix the underlying issue.

  2. Contact Google Cloud Support: If the problem persists, provide Google Cloud Support with detailed logs, configuration info, and the steps you've already tried.

Additional Considerations

  • API Rate Limits: Make sure you're not hitting Dataproc API rate limits, which could restrict how quickly you get responses.
  • External Dependencies: If your jobs rely on other services outside of Dataproc, those could also be causing delays.
  • Documentation: Always refer to the latest Google Cloud documentation for best practices and configuration recommendations.