I am trying to use Apache Hudi component on Dataproc cluster
I ran the example code provided by Google, but it doesn't work. (https://cloud.google.com/dataproc/docs/concepts/components/hudi)
When I run the spark query w/ hudi I get the following error
java.lang.ClassNotFoundException:
Failed to find data source: hudi. Please find packages at
https://spark.apache.org/third-party-projects.html
Also, according to the documentation, the executable script should be located in the path below.
/usr/lib/hudi/cli
But it doesn't exist
Below is the cluster creation script used to use the hudi component.
gcloud dataproc clusters create hudi-poc \
--enable-component-gateway --master-machine-type n2-standard-2 \
--master-boot-disk-size 200 --num-workers 2 \
--worker-machine-type e2-standard-2 --worker-boot-disk-size 100 \
--image-version 2.1.2-ubuntu20 --region us-central1 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--optional-components HUDI
Has anyone had success using hudi components on dataproc cluster?
Hi,
Try to add this property:
--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \