Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataproc high resource usage and its optimization and analysis

Hi,

I'm using Dataproc on VMs. Currently we are running in production two clusters:

 master VM typeworker VM typeScale NWorkers NMaster N
generalcustom-16-204800-extcustom-16-204800-ext821
accountingcustom-16-204800-extcustom-16-204800-ext1321

We currently execute 22 jobs every 3 minutes, divided between two categories:

  • accounting - 10 jobs
  • general - 12 jobs

The issue is that both job clusters are consuming all available resources, which is considerable.

These jobs process various files. The operations are not overly complicated; they simply load files into a Spark view, transform the view, perform computations, and that's all. Each job processes between 6 to 100 files (around 100 KB each), which may be either JSON or AVRO files. One job can read only one type of files.

Despite all jobs executing simultaneously, it appears that the clusters are over-sized for the tasks at hand. However, when resources are reduced, the jobs begin to run slower.

Why I think it’s over-sized?

For example, accounting cluster has 15 worker nodes when scaled to its peak. Each has 200 GB of RAM. This is 3 TB of memory. All jobs are executed almost simultaneously, so that dividing this size to number of jobs I can get approximate size available for one job, and its around 300GB per one job. This is 300GB to process files which are around 10 MB.

I suspect that the clusters may be experiencing a memory leak or another issue that hinders execution and retains memory. To investigate this, I've reviewed how the Spark session is terminated and found no issues. The session.stop() method is called at the end of each job.

What else could be checked?

0 4 1,490
4 REPLIES 4