py4j.protocol.Py4JJavaError: An error occurred while calling o764.save.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: unexpected issue trying to save [period_start: date, period_end: date ... 20 more fields]
...
Caused by: org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
...
Caused by: java.lang.NullPointerException
at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
(
df.write.format("bigquery")
.partitionBy("period_start")
.option("writeMethod", "direct")
.option("table", bigquery_destination_table)
.option("createDisposition", "CREATE_IF_NEEDED")
.mode("overwrite")
.save()
)
I initialized the spark session with:
spark = (
SparkSession.builder.appName("my_app")
.config(
"spark.jars",
",".join(["https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar",
"https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.36.1.jar",]),)
.getOrCreate()
)
spark.conf.set("temporaryGcsBucket", "my-bucket")
spark.conf.set("viewsEnabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")
spark.sparkContext.setLogLevel("WARN")
Solved! Go to Solution.
Hi @lorenh,
Welcome to Google Cloud Community!
The error “You hit a bug in Spark or the Spark plugins you use” message indicates a problem deep within Spark's SQL optimization process when interacting with the BigQuery connector.
The pipeline might not be working due to incompatibility of the dataproc version. I suggest checking the jar file that you are using and test what is compatible in your dataproc version.
For a deeper investigation, you can reach out to the Google Cloud support.
I hope the above information is helpful.
Hi @lorenh,
Welcome to Google Cloud Community!
The error “You hit a bug in Spark or the Spark plugins you use” message indicates a problem deep within Spark's SQL optimization process when interacting with the BigQuery connector.
The pipeline might not be working due to incompatibility of the dataproc version. I suggest checking the jar file that you are using and test what is compatible in your dataproc version.
For a deeper investigation, you can reach out to the Google Cloud support.
I hope the above information is helpful.
Thanks! I reached out via the Github repo for pyspark-bigquery connector, and they pointed me to the latest version, 0.41.0. That and a few code changes on my end got it working.