Solved: Internal Error when writing Pyspark to Bigquery Da...

lorenh · 09-30-2024 03:15 PM

I've been getting an internal error from Spark related to a `java.lang.NullPointerException` when I attempt to write from Pyspark to Bigquery in Dataproc. I'm using the bigquery plugin for pyspark, and writing to an existing, partitioned table. The error only seems to occur when I am writing a table that is the result of a join.

Error Message

py4j.protocol.Py4JJavaError: An error occurred while calling o764.save.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: unexpected issue trying to save [period_start: date, period_end: date ... 20 more fields]
...
Caused by: org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
...
Caused by: java.lang.NullPointerException
at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)

This is the failing call:

(
    df.write.format("bigquery")
    .partitionBy("period_start")
    .option("writeMethod", "direct")
    .option("table", bigquery_destination_table)
    .option("createDisposition", "CREATE_IF_NEEDED")
    .mode("overwrite")
    .save()
)

I initialized the spark session with:

spark = (
SparkSession.builder.appName("my_app")
.config(
"spark.jars",
",".join(["https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar",
"https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.36.1.jar",]),)
.getOrCreate()
)
spark.conf.set("temporaryGcsBucket", "my-bucket")
spark.conf.set("viewsEnabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")
spark.sparkContext.setLogLevel("WARN")

How do I get around this error to write the table to BigQuery?

Full stack trace includes double dollar sign which is not allowed, so I am not able to post it.

NorieRam

Hi @lorenh,

Welcome to Google Cloud Community!

The error “You hit a bug in Spark or the Spark plugins you use” message indicates a problem deep within Spark's SQL optimization process when interacting with the BigQuery connector.

The pipeline might not be working due to incompatibility of the dataproc version. I suggest checking the jar file that you are using and test what is compatible in your dataproc version.

For a deeper investigation, you can reach out to the Google Cloud support.

I hope the above information is helpful.

View solution in original post

NorieRam