Hi,
I'm using Dataproc on VMs. Currently we are running in production two clusters:
master VM type | worker VM type | Scale N | Workers N | Master N | |
general | custom-16-204800-ext | custom-16-204800-ext | 8 | 2 | 1 |
accounting | custom-16-204800-ext | custom-16-204800-ext | 13 | 2 | 1 |
We currently execute 22 jobs every 3 minutes, divided between two categories:
The issue is that both job clusters are consuming all available resources, which is considerable.
These jobs process various files. The operations are not overly complicated; they simply load files into a Spark view, transform the view, perform computations, and that's all. Each job processes between 6 to 100 files (around 100 KB each), which may be either JSON or AVRO files. One job can read only one type of files.
Despite all jobs executing simultaneously, it appears that the clusters are over-sized for the tasks at hand. However, when resources are reduced, the jobs begin to run slower.
Why I think it’s over-sized?
For example, accounting cluster has 15 worker nodes when scaled to its peak. Each has 200 GB of RAM. This is 3 TB of memory. All jobs are executed almost simultaneously, so that dividing this size to number of jobs I can get approximate size available for one job, and its around 300GB per one job. This is 300GB to process files which are around 10 MB.
I suspect that the clusters may be experiencing a memory leak or another issue that hinders execution and retains memory. To investigate this, I've reviewed how the Spark session is terminated and found no issues. The session.stop() method is called at the end of each job.
What else could be checked?
Here are some potential areas for investigation and optimization that could help address the issues you've identified:
1. Memory Allocation Optimization
Action Steps:
Utilize Spark's built-in metrics and monitoring tools to gather detailed memory usage data for each job.
Analyze the memory allocation versus actual usage to identify over-allocation or under-utilization patterns.
Adjust Spark memory settings accordingly, focusing on reducing excess allocation without impacting job performance.
2. Tuning Degree of Parallelism
Action Steps:
Conduct experiments with different configurations of executors, cores, and memory settings to find the optimal balance.
Evaluate the impact of changes on job performance and resource utilization, aiming for improved efficiency without additional hardware.
Implement the most effective configuration across your clusters, monitoring closely for any adjustments needed as workloads evolve.
3. Addressing Data Skew
Action Steps:
Perform a detailed analysis of job execution times and data processing patterns to identify potential data skew issues.
Experiment with techniques such as salting, custom partitioning, or adjusting Spark's partitioner settings to achieve a more balanced data distribution.
Monitor the effects of these adjustments on job performance, particularly for jobs previously identified as outliers in terms of execution time.
Additional Considerations
Shuffle Optimization:
Review and optimize Spark's shuffle operations by adjusting configurations such as spark.shuffle.file.buffer
and spark.reducer.maxSizeInFlight
.
Consider the use of external shuffle services if appropriate for your environment to reduce the load on executors.
Serialization Efficiency:
Confirm that AVRO is being used effectively for your data types, and consider compression options to reduce I/O overhead.
Evaluate the serialization and deserialization times for your jobs, looking for opportunities to switch to more efficient formats or configurations.
Enhanced Monitoring:
Leverage both the Spark UI and Google Cloud Dataproc's monitoring tools to gain insights into cluster and job performance.
Set up alerts for key metrics to proactively manage resource utilization and job execution times.
Thank you for sharing this plan of actions.
Is it possible to get somewhere more details about this step? "Analyze the memory allocation versus actual usage to identify over-allocation or under-utilization patterns."
Thank you.
Analyzing memory allocation versus actual usage in Spark involves a few steps and tools that can help you identify whether your resources are being over-allocated or under-utilized. This process is crucial for optimizing your Spark jobs for better performance and efficiency. Here's a more detailed breakdown of how you can approach this:
1. Start by Gathering Data
2. Connect with Monitoring Tools
3. Experiment and Adjust
spark.executor.memory
). Make changes gradually and carefully watch how they impact your jobs.4. Document and Repeat
Key Points
Hi @ms4446 ,
How to get this metrics? Can we get this metrics with GCP Metrics Explorer?