azure blob source in data fusion pipeline

Hi,

I am creating a piline in data fusion with azure blob as source and gcs bucket as target.When I am running my pipeline the flow gets stuck with below warning and then it fails with timeout error.Please let me know if anyone faced simialr issue or guide me how to resolve this

 
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HftpFileSystem not found

chandu_5757_0-1709228890656.png

 

Solved Solved
0 5 129
1 ACCEPTED SOLUTION

While Google Cloud Datafusion offers robust capabilities for data integration and transformation, it's important to accurately assess its suitability for your specific use case of transferring files between Azure Blob and GCS at 15-minute intervals. Here's a refined perspective on using Datafusion for this task and some insights into potential challenges:

Suitability of Datafusion

  • Scheduled Transfers: Datafusion supports scheduling, which can facilitate regular data transfer jobs. However, for very frequent, lightweight file transfer tasks, alternative tools might be more efficient.

  • Data Processing vs. File Transfer: Datafusion excels in scenarios where data needs to be processed or transformed during transfer. If your requirement is to move files as-is, without processing, the overhead of Datafusion pipelines might not be the most efficient approach.

  • Monitoring and Management: Datafusion provides comprehensive monitoring capabilities, which can be beneficial for overseeing scheduled data transfer tasks. Yet, for simple file transfers, other tools might offer simpler management and sufficient monitoring.

Investigating Failures

  • Timeout and Resource Limits: Check the timeout settings for your Datafusion pipeline and connectors, and ensure your network and Datafusion instance are configured to handle the sizes of files being transferred. Adjustments may be necessary to accommodate larger files or to optimize performance.

  • Network and File Size Considerations: Large file sizes or suboptimal network conditions can impact transfer times. Assess whether these factors are contributing to the failures you're experiencing.

Alternatives for Frequent File Transfers

  • Azure Logic Apps: For precise scheduling and direct file transfers, Azure Logic Apps offers a recurrence trigger that can be set to 15-minute intervals, along with connectors for both Azure Blob Storage and GCS.

  • Event-Driven Transfers with Azure Functions: An Azure Function triggered by new blobs in Azure Blob Storage could initiate transfers to GCS, providing a responsive and potentially more efficient mechanism for file movement.

While Datafusion is a powerful tool for data integration and transformation, its use for direct, frequent file transfers without data processing might not be the most efficient or cost-effective choice. Considering alternatives specifically designed for file synchronization or transfers could provide better suited solutions for your needs. Always align the tool choice with the operational requirements and constraints of your specific file transfer scenario.

View solution in original post

5 REPLIES 5

The warnings about missing Hadoop file system providers (org.apache.hadoop.hdfs.web.HftpFileSystem and org.apache.hadoop.hdfs.web.HsftpFileSystem) are intriguing, as they are not directly involved in operations between Azure Blob and GCS. Their absence, however, hints at underlying configuration or dependency issues that could indirectly affect your pipeline's functionality.

Troubleshooting Steps

Verify Azure Blob Connector:

  • Ensure the Azure Blob Storage connector in your Datafusion pipeline is correctly configured:

    • Authentication Type: Confirm you're using the correct authentication method (e.g., Shared Key, SAS Token).

    • Credentials: Verify the accuracy of the account name and access keys or SAS token.

    • Endpoint: Check that the Azure Blob Storage endpoint is correctly specified.

Check GCS Connector:

  • Confirm the setup of your GCS Bucket connector is accurate and that you have the necessary permissions on the GCS bucket.

Address Hadoop Dependencies:

  • The HDFS-related warnings, while possibly benign, should be resolved to avoid potential conflicts. Ensure your Datafusion instance includes the necessary Hadoop libraries for Azure Blob (hadoop-azure) and GCS (gcs-connector). These libraries might need to be added to your pipeline or instance configuration manually.

Network Considerations:

  • Review any firewall or network configuration that might be blocking connectivity between your Datafusion instance, Azure Blob Storage, and GCS. Ensure the necessary ports and protocols are open, and check for any required proxy settings in private or restricted network environments.

Timeout Configuration:

  • Examine and adjust the timeout settings in your pipeline and connectors, especially if large files or network conditions are causing delays. Specific settings for Azure Blob and GCS configurations should be reviewed.

Logging and Monitoring:

  • Increase the logging level for more detailed insights. This can often be done through the Datafusion UI or by integrating with Google Cloud Logging (Stackdriver) for comprehensive logging and monitoring.

Additional Tips

  • Compatibility between the Datafusion version, Hadoop libraries, and connectors (Azure Blob and GCS) is crucial. Ensure all components are compatible to avoid subtle issues.

  • For complex troubleshooting, consider enabling DEBUG or TRACE level logging for Hadoop components, but be mindful of the potential for large volumes of logs.

  • Verify that the Datafusion service account (or the account running the pipeline) has the necessary IAM roles and permissions for both Azure Blob Storage and GCS, such as Storage Object Creator, Storage Object Viewer, and Storage Object Admin roles for GCS, with equivalent permissions for Azure Blob.

thanks for the reply.I just wanted to see if there is a way to copy file itself rather than copy contents inside the file  from azure blob to gcs bucket using data fusion . I mean direct move or copy of the file

While Datafusion's core strength lies in data integration and transformation, it can be used to effectively move data between Azure Blob Storage and GCS. Here's a breakdown of the method and essential things to consider:

Method

  1. Configure Connectors: Establish connections to Azure Blob Storage (source) and GCS Bucket (sink) within your Datafusion pipeline. Provide all necessary authentication details.
  2. Simple Transfer: If you just want to copy files without modifications, you don't need any intermediate transformations in the pipeline.

Important Considerations

  • Permissions: Your Datafusion instance must have read permissions on Azure Blob Storage and write permissions on the GCS bucket.
  • File Overwrites: Decide if you want to overwrite existing files in your GCS bucket or create unique names to avoid conflicts.
  • Delete on Copy (Move Behavior): If you want to delete files from Azure Blob Storage after they're copied, you'll need a separate process or custom code. Datafusion doesn't natively support this in a single pipeline.

Example Pipeline Structure

 

[Azure Blob Source] --> [GCS Bucket Sink] 

 

[Azure Blob Source] --> [GCS Bucket Sink] --> [Custom Delete Operation] 

Datafusion excels at handling data streams between systems. It reads from Azure Blob and writes to GCS. Keep in mind that this may not preserve all original file metadata, and it's fundamentally different from a direct filesystem copy.

Enhanced Perspective

If your goal is purely file transfer with no transformations, consider:

  • Dedicated File Transfer Services: Like Google Storage Transfer Service.
  • Command-line Tools: gsutil (GCS) or azcopy (Azure).

These often provide simpler workflows, optimized file transfer mechanisms, and may be more cost-effective – especially when dealing with large amounts of data or when preserving file metadata is crucial.

I am trying to transfer the files as it is .Not read the file and transfer data

so my pipeline has azure blob--->gcs

but it is failing after running for long time..the reason why I am chossing is this we have requirment to run this transfer every 15 mins from source to gcs bucket ..data transfer  and storage transfer and az copy has option to schedule hourly or daily

Please correct me if I am wrong

While Google Cloud Datafusion offers robust capabilities for data integration and transformation, it's important to accurately assess its suitability for your specific use case of transferring files between Azure Blob and GCS at 15-minute intervals. Here's a refined perspective on using Datafusion for this task and some insights into potential challenges:

Suitability of Datafusion

  • Scheduled Transfers: Datafusion supports scheduling, which can facilitate regular data transfer jobs. However, for very frequent, lightweight file transfer tasks, alternative tools might be more efficient.

  • Data Processing vs. File Transfer: Datafusion excels in scenarios where data needs to be processed or transformed during transfer. If your requirement is to move files as-is, without processing, the overhead of Datafusion pipelines might not be the most efficient approach.

  • Monitoring and Management: Datafusion provides comprehensive monitoring capabilities, which can be beneficial for overseeing scheduled data transfer tasks. Yet, for simple file transfers, other tools might offer simpler management and sufficient monitoring.

Investigating Failures

  • Timeout and Resource Limits: Check the timeout settings for your Datafusion pipeline and connectors, and ensure your network and Datafusion instance are configured to handle the sizes of files being transferred. Adjustments may be necessary to accommodate larger files or to optimize performance.

  • Network and File Size Considerations: Large file sizes or suboptimal network conditions can impact transfer times. Assess whether these factors are contributing to the failures you're experiencing.

Alternatives for Frequent File Transfers

  • Azure Logic Apps: For precise scheduling and direct file transfers, Azure Logic Apps offers a recurrence trigger that can be set to 15-minute intervals, along with connectors for both Azure Blob Storage and GCS.

  • Event-Driven Transfers with Azure Functions: An Azure Function triggered by new blobs in Azure Blob Storage could initiate transfers to GCS, providing a responsive and potentially more efficient mechanism for file movement.

While Datafusion is a powerful tool for data integration and transformation, its use for direct, frequent file transfers without data processing might not be the most efficient or cost-effective choice. Considering alternatives specifically designed for file synchronization or transfers could provide better suited solutions for your needs. Always align the tool choice with the operational requirements and constraints of your specific file transfer scenario.