Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Query bigquery iceberg tables using spark

Hi,

I am trying to modify an iceberg table created in bigquery using apache spark hosted using a Dataproc cluster.

Steps I have taken so far:

  1. Created an external connection for iceberg
  2. Created a gcs bucket to store the metadata and data files for the iceberg table
  3. Created an empty table using the iceberg for bigquery option pointing to the storage created in step 2.
  4. Created a dataproc cluster to read the iceberg bigquery table using spark

Issue:

I am trying to install the iceberg dependencies in the dataproc clsuter where i am getting the module not found error.

 

 

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0

 

Screenshot 2025-03-28 at 10.52.52 PM.png

 

steps taken:

  1. I tried tweaking the runtime version a lot of times and passing on repositories flag with maven central url as well 

where i want to go:

I want to primarily to be able to run the below code as stated in the official docs;

Screenshot 2025-03-28 at 10.56.14 PM.png

let me know if you need any other details, thanks.

Solved Solved
0 2 279
1 ACCEPTED SOLUTION

Hi @cassandramae

I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.

The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.

Thanks for reaching out and I have made a note of your solution as well.

 

View solution in original post

2 REPLIES 2

Hi @logan_hewlett,

Welcome to Google Cloud Community!

It looks like you're using Spark SQL with Apache Iceberg. According to Apache Iceberg documention, you must add the Iceberg Spark runtime to Spark's jars folder (if you haven't done so). As a workaround, you may also consider using Iceberg on Dataproc by creating a cluster with Iceberg optional component. This is only optional but you can set engine specific Iceberg properties using the appropriate prefix (e.g spark) just make sure you are running Iceberg on a supported image version.

See gcloud command below:

gcloud dataproc clusters create  \
  --region= \
  --optional-components=ICEBERG \
  --image-version=
  --properties=

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Hi @cassandramae

I was able to solve this by creating a public NAT gateway and then spinning up a dataproc cluster using that gateway.

The issue was dataproc was not able to access the internet to fetch dependencies and setting up NAT helped the case.

Thanks for reaching out and I have made a note of your solution as well.