Hi,
I am trying to modify an iceberg table created in bigquery using apache spark hosted using a Dataproc cluster.
Steps I have taken so far:
Issue:
I am trying to install the iceberg dependencies in the dataproc clsuter where i am getting the module not found error.
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0
steps taken:
where i want to go:
I want to primarily to be able to run the below code as stated in the official docs;
let me know if you need any other details, thanks.
Solved! Go to Solution.
Hi @cassandramae,
I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.
The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.
Thanks for reaching out and I have made a note of your solution as well.
Hi @logan_hewlett,
Welcome to Google Cloud Community!
It looks like you're using Spark SQL with Apache Iceberg. According to Apache Iceberg documention, you must add the Iceberg Spark runtime to Spark's jars folder (if you haven't done so). As a workaround, you may also consider using Iceberg on Dataproc by creating a cluster with Iceberg optional component. This is only optional but you can set engine specific Iceberg properties using the appropriate prefix (e.g spark) just make sure you are running Iceberg on a supported image version.
See gcloud command below:
gcloud dataproc clusters create \
--region= \
--optional-components=ICEBERG \
--image-version=
--properties=
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @cassandramae,
I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.
The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.
Thanks for reaching out and I have made a note of your solution as well.