Hi!
I have a Dataproc Single Node Cluster that is used as a Spark History Server.
However I notice in Cloud Logging that this is creating quite a lot of logs.
Looking at the documentation I can see that indeed there are a number of components that will generate logs: https://cloud.google.com/dataproc/docs/guides/logging#cluster_logs_in
But how do I alter/modify the behaviour of these logs? Ideally I want to set the log level to ERROR/OFF if possible.
Of course I could always create a Exclusion Rule on the Sink level but that feels like I'm fixing the problem at the wrong end.
Any ideas, since it isn't obvious from the documentation what is possible
To manage and reduce the volume of logs generated by your Dataproc Single Node Cluster used as a Spark History Server, you can adopt a multi-faceted approach that involves adjusting log levels directly within the cluster's components and utilizing Google Cloud's logging features for more granular control. Here's a comprehensive guide to achieve this:
Adjusting Log Levels in Dataproc
Modify Component Log Levels
For components like Spark and Hadoop, which are prevalent in Dataproc clusters, log levels are typically managed through configuration files such as log4j.properties
. Depending on the Dataproc image version, the approach varies slightly:
For Dataproc Image Versions < 2.0: You'll need to create and configure a log4j.properties
file, then upload it to a Cloud Storage bucket. This file should then be referenced in your cluster's initialization actions to apply the custom logging settings.
For Dataproc Image Versions >= 2.0: You can directly set Spark and other component log levels using the --properties
flag during cluster creation. This method allows you to specify log levels without needing to manage separate configuration files.
Example log4j.properties
:
# Root logger option
log4j.rootLogger=WARN, stdout
# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Reduce Spark Logging
log4j.logger.org.apache.spark=WARN
# Reduce Hadoop/YARN Logging
log4j.logger.org.apache.hadoop.yarn=WARN
Best Practices for Logging
Start with WARN Level: Initially setting log levels to WARN provides a balanced approach, capturing potential issues without overwhelming volume. ERROR might be too restrictive for troubleshooting.
Troubleshooting Tip: Remember, you can temporarily increase log levels to INFO or DEBUG when investigating specific issues. After resolving the problem, revert to WARN or ERROR to reduce log volume.
Balance Logging and Cost: Utilize Cloud Logging's Log Viewer for efficient log filtering and searching, allowing for broader log retention without excessive review.
Utilizing Cloud Logging Features
Log Exclusions
While adjusting source log levels is effective, you can also use Cloud Logging's exclusion rules to filter out logs you deem unnecessary. This method doesn't reduce the generation of logs but can significantly cut down on storage and processing costs.
Advanced Considerations
Custom Logging Solutions: For complex logging needs, consider deploying logging agents like Fluentd within your cluster through initialization actions. This allows for enhanced log management and integration with external systems.
Cost Analysis: Regularly review Cloud Logging's cost estimates and visualizations to understand the impact of your logging configurations and make informed adjustments.
Implementing These Strategies
Prepare your log4j.properties
file and upload it to Cloud Storage if using Dataproc image versions below 2.0.
Use initialization actions to apply custom logging configurations or deploy logging agents.
Adjust logging levels during cluster creation for newer Dataproc versions using the --properties
flag.
Implement Cloud Logging exclusions to manage log storage and processing costs effectively.