Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

azure blob source in data fusion pipeline

Hi,

I am creating a piline in data fusion with azure blob as source and gcs bucket as target.When I am running my pipeline the flow gets stuck with below warning and then it fails with timeout error.Please let me know if anyone faced simialr issue or guide me how to resolve this

 
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HftpFileSystem not found

chandu_5757_0-1709228890656.png

 

Solved Solved
0 5 543
1 ACCEPTED SOLUTION

While Google Cloud Datafusion offers robust capabilities for data integration and transformation, it's important to accurately assess its suitability for your specific use case of transferring files between Azure Blob and GCS at 15-minute intervals. Here's a refined perspective on using Datafusion for this task and some insights into potential challenges:

Suitability of Datafusion

  • Scheduled Transfers: Datafusion supports scheduling, which can facilitate regular data transfer jobs. However, for very frequent, lightweight file transfer tasks, alternative tools might be more efficient.

  • Data Processing vs. File Transfer: Datafusion excels in scenarios where data needs to be processed or transformed during transfer. If your requirement is to move files as-is, without processing, the overhead of Datafusion pipelines might not be the most efficient approach.

  • Monitoring and Management: Datafusion provides comprehensive monitoring capabilities, which can be beneficial for overseeing scheduled data transfer tasks. Yet, for simple file transfers, other tools might offer simpler management and sufficient monitoring.

Investigating Failures

  • Timeout and Resource Limits: Check the timeout settings for your Datafusion pipeline and connectors, and ensure your network and Datafusion instance are configured to handle the sizes of files being transferred. Adjustments may be necessary to accommodate larger files or to optimize performance.

  • Network and File Size Considerations: Large file sizes or suboptimal network conditions can impact transfer times. Assess whether these factors are contributing to the failures you're experiencing.

Alternatives for Frequent File Transfers

  • Azure Logic Apps: For precise scheduling and direct file transfers, Azure Logic Apps offers a recurrence trigger that can be set to 15-minute intervals, along with connectors for both Azure Blob Storage and GCS.

  • Event-Driven Transfers with Azure Functions: An Azure Function triggered by new blobs in Azure Blob Storage could initiate transfers to GCS, providing a responsive and potentially more efficient mechanism for file movement.

While Datafusion is a powerful tool for data integration and transformation, its use for direct, frequent file transfers without data processing might not be the most efficient or cost-effective choice. Considering alternatives specifically designed for file synchronization or transfers could provide better suited solutions for your needs. Always align the tool choice with the operational requirements and constraints of your specific file transfer scenario.

View solution in original post

5 REPLIES 5