We have just created Dataproc on GCE (2.2.39-debian12) with Zeppelin component, we are trying to create table in Spark SQL from data available in avro format by running the following block in Zeppelin.
%spark.sql
CREATE TABLE daily_stats USING AVRO OPTIONS (path "gs://bucket/path/to/data/*.avro");
after a couple of Interpreter's failure, I tried to access spark-sql cli through the VM
spark-sql --packages org.apache.spark:spark-avro_2.12:3.5.0
it seems like the spark-avro_2.12 was never downloaded - and it just started to download - even though I made sure it add it to spark.jar.packages in spark's Interpreter config through Zeppelin's UI
I managed to create the table and run some queries without any issues - the same in Zeppelin Notebook after the package was downloaded in my cli session - but I noticed that running a query that returns a couple of thousand in rows results in the following loop of errors and Zeppelin being stuck
INFO [2024-11-11 18:18:53,498] ({JobStatusPoller-paragraph_1731338412819_1362046418} NotebookServer.java[onStatusChange]:1987) - Job paragraph_1731338412819_1362046418 starts to RUNNING
INFO [2024-11-11 18:19:27,168] ({qtp524223214-19} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:39186 (1009) Text message size [1048533] exceeds maximum size [1024000]
INFO [2024-11-11 18:19:33,866] ({qtp524223214-19} NotebookServer.java[onOpen]:244) - New connection from 127.0.0.1:51090
INFO [2024-11-11 18:19:35,467] ({qtp524223214-13} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:51090 (1011) EofException
INFO [2024-11-11 18:19:41,757] ({qtp524223214-14} NotebookServer.java[onOpen]:244) - New connection from 127.0.0.1:51102
INFO [2024-11-11 18:19:42,689] ({qtp524223214-17} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:51102 (1011) EofException
INFO [2024-11-11 18:19:48,680] ({qtp524223214-17} NotebookServer.java[onOpen]:244) - New connection from 127.0.0.1:56664
INFO [2024-11-11 18:19:49,621] ({qtp524223214-19} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:56664 (1011) EofException
INFO [2024-11-11 18:19:55,783] ({qtp524223214-14} NotebookServer.java[onOpen]:244) - New connection from 127.0.0.1:52348
INFO [2024-11-11 18:19:56,773] ({qtp524223214-16} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:52348 (1009) Text message size [1061535] exceeds maximum size [1024000]
INFO [2024-11-11 18:20:02,772] ({qtp524223214-13} NotebookServer.java[onOpen]:244) - New connection from 127.0.0.1:39742
INFO [2024-11-11 18:20:04,113] ({qtp524223214-19} NotebookServer.java[onClose]:472) - Closed connection to 127.0.0.1:39742 (1009) Text message size [1061535] exceeds maximum size [1024000]
I tried to increase ZEPPELIN_WEBSOCKET_MAX_TEXT_MESSAGE_SIZE - but it didn't work somehow it is never being reflected and I'll be still receiving this message - I tried to change this variable because the value in the message 1024000 is the default value for this environment variable. - I tried to change this in zeppelin-env.sh and zeppelin-site.xml but none has helped.
In your Dataproc cluster (version 2.2.39-debian12), the "Text message size exceeds maximum size [1024000]" error occurs in Zeppelin due to the default WebSocket message size limit of 1MB. This issue arises when large query results exceed this limit, causing WebSocket disconnections and rendering Zeppelin unresponsive.
Attempts to modify ZEPPELIN_WEBSOCKET_MAX_TEXT_MESSAGE_SIZE in zeppelin-env.sh or zeppelin-site.xml likely failed because Dataproc's configuration management overrides manual edits. Here are the most effective solutions:
Recommended Solution: Dataproc Cluster Properties
The most reliable approach is setting the zeppelin.websocket.max.text.message.size property using Dataproc cluster properties:
During Cluster Creation:
gcloud dataproc clusters create your-cluster-name \
--properties zeppelin:zeppelin.websocket.max.text.message.size=10485760
Updating an Existing Cluster:
gcloud dataproc clusters update your-cluster-name \
--properties zeppelin:zeppelin.websocket.max.text.message.size=10485760
Alternative Solutions
For cases where Dataproc properties are not feasible:
Manual Configuration:
SSH into the master node:
gcloud compute ssh your-cluster-name-m
Edit /etc/zeppelin/conf/zeppelin-site.xml:
<property>
<name>zeppelin.websocket.max.text.message.size</name>
<value>10485760</value>
</property>
Restart Zeppelin to apply changes:
sudo systemctl restart zeppelin
Initialization Actions for Automation:
Use an initialization action to automate and persist changes during cluster setup. Example script:
#!/bin/bash
set -euxo pipefail
echo '<property>
<name>zeppelin.websocket.max.text.message.size</name>
<value>10485760</value>
</property>' >> /etc/zeppelin/conf/zeppelin-site.xml
sudo systemctl restart zeppelin
Upload this script to Cloud Storage. Reference it during cluster creation:
gcloud dataproc clusters create your-cluster-name \
--initialization-actions gs://your-bucket/zeppelin-config.sh
After applying these solutions, verify the changes by checking Zeppelin logs in /var/log/zeppelin. Search for entries related to "websocket" or "max text message size" to confirm the new limit is recognized.
By correctly increasing the WebSocket message size, you will resolve disconnection issues and enable Zeppelin to handle larger query results from Spark SQL effectively. This approach ensures stability and scalability in your Dataproc environment.