Dataproc high resource usage and its optimization ... - Page 2

alex-goida · 03-20-2024 07:30 AM

Hi,

I'm using Dataproc on VMs. Currently we are running in production two clusters:

	master VM type	worker VM type	Scale N	Workers N	Master N
general	custom-16-204800-ext	custom-16-204800-ext	8	2	1
accounting	custom-16-204800-ext	custom-16-204800-ext	13	2	1

We currently execute 22 jobs every 3 minutes, divided between two categories:

accounting - 10 jobs
general - 12 jobs

The issue is that both job clusters are consuming all available resources, which is considerable.

These jobs process various files. The operations are not overly complicated; they simply load files into a Spark view, transform the view, perform computations, and that's all. Each job processes between 6 to 100 files (around 100 KB each), which may be either JSON or AVRO files. One job can read only one type of files.

Despite all jobs executing simultaneously, it appears that the clusters are over-sized for the tasks at hand. However, when resources are reduced, the jobs begin to run slower.

Why I think it’s over-sized?

For example, accounting cluster has 15 worker nodes when scaled to its peak. Each has 200 GB of RAM. This is 3 TB of memory. All jobs are executed almost simultaneously, so that dividing this size to number of jobs I can get approximate size available for one job, and its around 300GB per one job. This is 300GB to process files which are around 10 MB.

I suspect that the clusters may be experiencing a memory leak or another issue that hinders execution and retains memory. To investigate this, I've reviewed how the Spark session is terminated and found no issues. The session.stop() method is called at the end of each job.

What else could be checked?

Dataproc high resource usage and its optimization and analysis