Dataproc high resource usage and its optimization ...

alex-goida · 03-20-2024 07:30 AM

Hi,

I'm using Dataproc on VMs. Currently we are running in production two clusters:

	master VM type	worker VM type	Scale N	Workers N	Master N
general	custom-16-204800-ext	custom-16-204800-ext	8	2	1
accounting	custom-16-204800-ext	custom-16-204800-ext	13	2	1

We currently execute 22 jobs every 3 minutes, divided between two categories:

accounting - 10 jobs
general - 12 jobs

The issue is that both job clusters are consuming all available resources, which is considerable.

These jobs process various files. The operations are not overly complicated; they simply load files into a Spark view, transform the view, perform computations, and that's all. Each job processes between 6 to 100 files (around 100 KB each), which may be either JSON or AVRO files. One job can read only one type of files.

Despite all jobs executing simultaneously, it appears that the clusters are over-sized for the tasks at hand. However, when resources are reduced, the jobs begin to run slower.

Why I think it’s over-sized?

For example, accounting cluster has 15 worker nodes when scaled to its peak. Each has 200 GB of RAM. This is 3 TB of memory. All jobs are executed almost simultaneously, so that dividing this size to number of jobs I can get approximate size available for one job, and its around 300GB per one job. This is 300GB to process files which are around 10 MB.

I suspect that the clusters may be experiencing a memory leak or another issue that hinders execution and retains memory. To investigate this, I've reviewed how the Spark session is terminated and found no issues. The session.stop() method is called at the end of each job.

What else could be checked?

ms4446

Here are some potential areas for investigation and optimization that could help address the issues you've identified:

1. Memory Allocation Optimization

Action Steps:
- Utilize Spark's built-in metrics and monitoring tools to gather detailed memory usage data for each job.
- Analyze the memory allocation versus actual usage to identify over-allocation or under-utilization patterns.
- Adjust Spark memory settings accordingly, focusing on reducing excess allocation without impacting job performance.

2. Tuning Degree of Parallelism

Action Steps:
- Conduct experiments with different configurations of executors, cores, and memory settings to find the optimal balance.
- Evaluate the impact of changes on job performance and resource utilization, aiming for improved efficiency without additional hardware.
- Implement the most effective configuration across your clusters, monitoring closely for any adjustments needed as workloads evolve.

3. Addressing Data Skew

Action Steps:
- Perform a detailed analysis of job execution times and data processing patterns to identify potential data skew issues.
- Experiment with techniques such as salting, custom partitioning, or adjusting Spark's partitioner settings to achieve a more balanced data distribution.
- Monitor the effects of these adjustments on job performance, particularly for jobs previously identified as outliers in terms of execution time.

Additional Considerations

Shuffle Optimization:
- Review and optimize Spark's shuffle operations by adjusting configurations such as spark.shuffle.file.buffer and spark.reducer.maxSizeInFlight.
- Consider the use of external shuffle services if appropriate for your environment to reduce the load on executors.
Serialization Efficiency:
- Confirm that AVRO is being used effectively for your data types, and consider compression options to reduce I/O overhead.
- Evaluate the serialization and deserialization times for your jobs, looking for opportunities to switch to more efficient formats or configurations.
Enhanced Monitoring:
- Leverage both the Spark UI and Google Cloud Dataproc's monitoring tools to gain insights into cluster and job performance.
- Set up alerts for key metrics to proactively manage resource utilization and job execution times.

alex-goida

Thank you for sharing this plan of actions.

Is it possible to get somewhere more details about this step? "Analyze the memory allocation versus actual usage to identify over-allocation or under-utilization patterns."

Thank you.

ms4446

Analyzing memory allocation versus actual usage in Spark involves a few steps and tools that can help you identify whether your resources are being over-allocated or under-utilized. This process is crucial for optimizing your Spark jobs for better performance and efficiency. Here's a more detailed breakdown of how you can approach this:

1. Start by Gathering Data

Enable Spark Metrics: Make sure Spark is configured to track detailed information about memory usage, garbage collection, and other performance factors.
Spark UI: This built-in web interface is your first stop. It shows overall memory usage, storage memory details, and even breaks down usage by task.
Log Analysis: Turn on garbage collection (GC) logging to understand how your application manages memory over time. Tools like GCViewer or GCEasy can help you visualize the logs.

2. Connect with Monitoring Tools

Time-Based Tracking: Tools like Prometheus, Grafana, or cloud-specific monitoring solutions let you track memory usage across multiple jobs and over longer periods. This helps you spot patterns in how your applications use resources.

3. Experiment and Adjust

Informed Changes: Use what you've learned to adjust Spark's memory settings (like spark.executor.memory). Make changes gradually and carefully watch how they impact your jobs.

4. Document and Repeat

Keep Notes: Document what you learn and what changes have the best results. This will make it easier to optimize memory usage in the future.
Iterate: Application needs change over time. Think of memory optimization as an ongoing process!

Key Points

Analyzing memory allocation vs. actual usage is an essential part of running efficient Spark applications.
The Spark UI is a great place to start getting comfortable with memory metrics.
Don't be afraid to experiment! Just make sure to track changes and their impact.

AnkurRoy

Hi @ms4446 ,

How to get this metrics? Can we get this metrics with GCP Metrics Explorer?

Dataproc high resource usage and its optimization and analysis