Re: Disable Dataproc cluster logs

jeyob · 01-31-2024 06:39 AM

Hi!

I have a Dataproc Single Node Cluster that is used as a Spark History Server.

However I notice in Cloud Logging that this is creating quite a lot of logs.
Looking at the documentation I can see that indeed there are a number of components that will generate logs: https://cloud.google.com/dataproc/docs/guides/logging#cluster_logs_in

But how do I alter/modify the behaviour of these logs? Ideally I want to set the log level to ERROR/OFF if possible.

Of course I could always create a Exclusion Rule on the Sink level but that feels like I'm fixing the problem at the wrong end.

Any ideas, since it isn't obvious from the documentation what is possible

ms4446

To manage and reduce the volume of logs generated by your Dataproc Single Node Cluster used as a Spark History Server, you can adopt a multi-faceted approach that involves adjusting log levels directly within the cluster's components and utilizing Google Cloud's logging features for more granular control. Here's a comprehensive guide to achieve this:

Adjusting Log Levels in Dataproc

Modify Component Log Levels

For components like Spark and Hadoop, which are prevalent in Dataproc clusters, log levels are typically managed through configuration files such as log4j.properties. Depending on the Dataproc image version, the approach varies slightly:
- For Dataproc Image Versions < 2.0: You'll need to create and configure a log4j.properties file, then upload it to a Cloud Storage bucket. This file should then be referenced in your cluster's initialization actions to apply the custom logging settings.
- For Dataproc Image Versions >= 2.0: You can directly set Spark and other component log levels using the --properties flag during cluster creation. This method allows you to specify log levels without needing to manage separate configuration files.

Example log4j.properties:

# Root logger option 
log4j.rootLogger=WARN, stdout 

# Direct log messages to stdout 
log4j.appender.stdout=org.apache.log4j.ConsoleAppender 
log4j.appender.stdout.Target=System.out 
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n 

# Reduce Spark Logging 
log4j.logger.org.apache.spark=WARN 

# Reduce Hadoop/YARN Logging
log4j.logger.org.apache.hadoop.yarn=WARN

Best Practices for Logging
- Start with WARN Level: Initially setting log levels to WARN provides a balanced approach, capturing potential issues without overwhelming volume. ERROR might be too restrictive for troubleshooting.
- Troubleshooting Tip: Remember, you can temporarily increase log levels to INFO or DEBUG when investigating specific issues. After resolving the problem, revert to WARN or ERROR to reduce log volume.
- Balance Logging and Cost: Utilize Cloud Logging's Log Viewer for efficient log filtering and searching, allowing for broader log retention without excessive review.

Utilizing Cloud Logging Features

Log Exclusions

While adjusting source log levels is effective, you can also use Cloud Logging's exclusion rules to filter out logs you deem unnecessary. This method doesn't reduce the generation of logs but can significantly cut down on storage and processing costs.

Advanced Considerations

Custom Logging Solutions: For complex logging needs, consider deploying logging agents like Fluentd within your cluster through initialization actions. This allows for enhanced log management and integration with external systems.
Cost Analysis: Regularly review Cloud Logging's cost estimates and visualizations to understand the impact of your logging configurations and make informed adjustments.

Implementing These Strategies

Prepare your log4j.properties file and upload it to Cloud Storage if using Dataproc image versions below 2.0.
Use initialization actions to apply custom logging configurations or deploy logging agents.
Adjust logging levels during cluster creation for newer Dataproc versions using the --properties flag.
Implement Cloud Logging exclusions to manage log storage and processing costs effectively.