Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

PySpark stored procedure in BigQuery throws a syntax error

Hi,

I'm trying to create a stored procedure using inline PySpark code like this: 

CREATE OR REPLACE PROCEDURE my_bq_project.my_dataset.spark_proc()
WITH CONNECTION
`my-project-id.us.my-connection`
OPTIONS
(engine="SPARK", runtime_version="1.1")
LANGUAGE PYTHON AS R
"""
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("
spark-bigquery-demo").getOrCreate()

# Load data from BigQuery.
words = spark.read.format("
bigquery") \
  .option("
table", "bigquery-public-data:samples.shakespeare") \
  .load()
words.createOrReplaceTempView("
words")

# Perform word count.
word_count = words.select('word', 'word_count').groupBy('word').sum('word_count').withColumnRenamed("
sum(word_count)", "sum_word_count")
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format("
bigquery") \
  .option("
writeMethod", "direct") \
  .save("
wordcount_dataset.wordcount_output")
"""

But it throws an error:

Syntax error: Expected end of input but got keyword FROM at [8:1]
Not sure why, any ideas?
Solved Solved
1 3 970
2 ACCEPTED SOLUTIONS

The syntax error you're encountering is due to an incorrect language identifier for your PySpark code. While you've provided the code block within "R"""", this identifier is meant for R language, not Python. Additionally, the correct syntax for specifying a Python code block within a BigQuery stored procedure requires "LANGUAGE PYTHON AS """", not simply """.

Here's the revised version of your stored procedure with the corrected syntax:

 
CREATE OR REPLACE PROCEDURE my_bq_project.my_dataset.spark_proc()
WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS(engine="SPARK", runtime_version="1.1")
LANGUAGE PYTHON AS """
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark-bigquery-demo").getOrCreate()

# Load data from BigQuery.
words = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data:samples.shakespeare") \
  .load()
words.createOrReplaceTempView("words")

# Perform word count.
word_count = words.select('word', 'word_count').groupBy('word').sum('word_count').withColumnRenamed("sum(word_count)", "sum_word_count")
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format("bigquery") \
  .option("writeMethod", "direct") \
  .save("wordcount_dataset.wordcount_output")
"""

This revised code explicitly identifies the language as Python for the stored procedure in BigQuery and should resolve the syntax error you were encountering. Remember to verify that other aspects of the stored procedure, like connection and options, are properly configured for your specific scenario.

View solution in original post

This is unusual because the LANGUAGE PYTHON AS """ syntax should correctly encapsulate the Python code, preventing it from being parsed as SQL.

Here are a few additional steps to troubleshoot this issue:

  1. Check for Hidden Characters: Sometimes, hidden or non-printable characters can cause syntax errors. Ensure that your code does not contain any such characters, especially around the lines where the error is reported.

  2. Verify BigQuery Support for PySpark: As of my last update in April 2023, BigQuery supported stored procedures written in JavaScript and SQL. Ensure that BigQuery now supports PySpark in stored procedures and that your environment is correctly set up for this.

  3. Procedure Syntax Review: Double-check the syntax for creating stored procedures in BigQuery, especially for PySpark. There might be specific requirements or limitations for using PySpark within BigQuery stored procedures.

  4. Simplify the Code: Try simplifying your PySpark code or using a very basic Python script initially to test if the issue is with the PySpark code or the way it's embedded in the SQL procedure.

  5. Contact Google Cloud Support: If the problem persists and you're unable to find a solution through documentation or forums, consider reaching out to Google Cloud Support for more specialized assistance.

View solution in original post

3 REPLIES 3

The syntax error you're encountering is due to an incorrect language identifier for your PySpark code. While you've provided the code block within "R"""", this identifier is meant for R language, not Python. Additionally, the correct syntax for specifying a Python code block within a BigQuery stored procedure requires "LANGUAGE PYTHON AS """", not simply """.

Here's the revised version of your stored procedure with the corrected syntax:

 
CREATE OR REPLACE PROCEDURE my_bq_project.my_dataset.spark_proc()
WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS(engine="SPARK", runtime_version="1.1")
LANGUAGE PYTHON AS """
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark-bigquery-demo").getOrCreate()

# Load data from BigQuery.
words = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data:samples.shakespeare") \
  .load()
words.createOrReplaceTempView("words")

# Perform word count.
word_count = words.select('word', 'word_count').groupBy('word').sum('word_count').withColumnRenamed("sum(word_count)", "sum_word_count")
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format("bigquery") \
  .option("writeMethod", "direct") \
  .save("wordcount_dataset.wordcount_output")
"""

This revised code explicitly identifies the language as Python for the stored procedure in BigQuery and should resolve the syntax error you were encountering. Remember to verify that other aspects of the stored procedure, like connection and options, are properly configured for your specific scenario.

Hello and thank you for the reply.

I tried using the code you provided, but I am encountering the same error like before.

 

CREATE OR REPLACE PROCEDURE my_bq_project.my_dataset.spark_proc()
WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS(engine="SPARK", runtime_version="1.1")
LANGUAGE PYTHON AS """
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark-bigquery-demo").getOrCreate()

# Load data from BigQuery.
words = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data:samples.shakespeare") \
  .load()
words.createOrReplaceTempView("words")

# Perform word count.
word_count = words.select('word', 'word_count').groupBy('word').sum('word_count').withColumnRenamed("sum(word_count)", "sum_word_count")
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format("bigquery") \
  .option("writeMethod", "direct") \
  .save("wordcount_dataset.wordcount_output")
"""

 

Error:

 

Syntax error: Expected end of input but got keyword FROM at [8:1]

 

Screenshot:

CQBVPFm

Thanks!

This is unusual because the LANGUAGE PYTHON AS """ syntax should correctly encapsulate the Python code, preventing it from being parsed as SQL.

Here are a few additional steps to troubleshoot this issue:

  1. Check for Hidden Characters: Sometimes, hidden or non-printable characters can cause syntax errors. Ensure that your code does not contain any such characters, especially around the lines where the error is reported.

  2. Verify BigQuery Support for PySpark: As of my last update in April 2023, BigQuery supported stored procedures written in JavaScript and SQL. Ensure that BigQuery now supports PySpark in stored procedures and that your environment is correctly set up for this.

  3. Procedure Syntax Review: Double-check the syntax for creating stored procedures in BigQuery, especially for PySpark. There might be specific requirements or limitations for using PySpark within BigQuery stored procedures.

  4. Simplify the Code: Try simplifying your PySpark code or using a very basic Python script initially to test if the issue is with the PySpark code or the way it's embedded in the SQL procedure.

  5. Contact Google Cloud Support: If the problem persists and you're unable to find a solution through documentation or forums, consider reaching out to Google Cloud Support for more specialized assistance.