Solved: Query bigquery iceberg tables using spark

logan_hewlett · 03-28-2025 10:27 AM

Hi,

I am trying to modify an iceberg table created in bigquery using apache spark hosted using a Dataproc cluster.

Steps I have taken so far:

Created an external connection for iceberg
Created a gcs bucket to store the metadata and data files for the iceberg table
Created an empty table using the iceberg for bigquery option pointing to the storage created in step 2.
Created a dataproc cluster to read the iceberg bigquery table using spark

Issue:

I am trying to install the iceberg dependencies in the dataproc clsuter where i am getting the module not found error.

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0

steps taken:

I tried tweaking the runtime version a lot of times and passing on repositories flag with maven central url as well

where i want to go:

I want to primarily to be able to run the below code as stated in the official docs;

let me know if you need any other details, thanks.

logan_hewlett

Hi @cassandramae,

I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.

The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.

Thanks for reaching out and I have made a note of your solution as well.

View solution in original post

cassandramae

Hi @logan_hewlett,

Welcome to Google Cloud Community!

It looks like you're using Spark SQL with Apache Iceberg. According to Apache Iceberg documention, you must add the Iceberg Spark runtime to Spark's jars folder (if you haven't done so). As a workaround, you may also consider using Iceberg on Dataproc by creating a cluster with Iceberg optional component. This is only optional but you can set engine specific Iceberg properties using the appropriate prefix (e.g spark) just make sure you are running Iceberg on a supported image version.

See gcloud command below:

gcloud dataproc clusters create  \
  --region= \
  --optional-components=ICEBERG \
  --image-version=
  --properties=

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

logan_hewlett

Hi @cassandramae,

I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.

The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.

Thanks for reaching out and I have made a note of your solution as well.