Re: Apache Hudi or Dataflow

ravifanatic · 09-21-2021 07:48 AM

which solution is better for batch workloads Apache Hudi or Dataflow ? considering the daily volume of 20-25 GB/day . my preference is dataflow as it's a serverless service and it supports both batch and streaming workloads. But for complex joins or significant data crunching is apache hudi the best option.

economize-cloud

Dataflow will be able to handle the volume you have requested. It can unify your streaming and batch workloads, keeping it easy to migrate and re-use code from batch to streaming.

When using Hudi, you may need to choose different tools for data processing. You should also check other Google Cloud Platform tools like [Cloud Data Fusion and Dataproc etc.

shiva_iyer

What are your sources? Is the data going to be transformed and loaded to BQ?

If you want to go cloud-native, then Dataflow could be an option. GCP Databricks+delta lake is another option you could consider. But a lot will depend on what sources you are extracting data from