Re: Autoscaling not working for Dataproc GCE clust...

waynez · 07-24-2024 03:18 PM

I created an autoscaling policy and attached to a cluster with 2 primary worker nodes (2 core with 8gb). I ran a PySpark code with number of executors set to 50, but it did not trigger autoscaling.

Went through the documentations, most is on the theory explanation, but no real samples. I even tried core-based autoscaling policy.

Could someone provide a real life configuration to demo?


Here is the policy I got:
{ "id": "xyz", "name": "projects/unravel-dataproc/regions/us-central1/autoscalingPolicies/xyz", "basicAlgorithm": { "yarnConfig": { "scaleUpFactor": 1, "gracefulDecommissionTimeout": "3600s" }, "cooldownPeriod": "120s" }, "workerConfig": { "minInstances": 2, "maxInstances": 2, "weight": 1 }, "secondaryWorkerConfig": { "maxInstances": 5, "weight": 1 } }

jangemmar

Hi @waynez,

Welcome to Google Cloud Community!

Upon checking the policy you provided, it seems like your minInstances and maxInstances are both set to 2. This simply means your cluster is fixed at 2 worker nodes and no scaling will occur, even if you request 50 executors.

Here’s what you can do, increase the maxInstances greater than 2 to make your autoscaling policy work:

{
  "id": "xyz",
  "name": "projects/unravel-dataproc/regions/us-central1/autoscalingPolicies/xyz",
  "basicAlgorithm": {
    "yarnConfig": {
      "scaleUpFactor": 1
      "gracefulDecommissionTimeout": "3600s" 
    },
    "cooldownPeriod": "120s" 
  },
  "workerConfig": {
    "minInstances": 2,  // Start with 2 worker nodes
    "maxInstances": 10, // Allow scaling up to 10 worker nodes
    "weight": 1
  },
  "secondaryWorkerConfig": {
    "maxInstances": 5,
    "weight": 1 
  }
}

Note: scaleUpFactor can be configured between 0.0 and 1.0

I hope the above information is helpful.

waynez

Thanks for the answer and I will give a try. The documentation is not clear as I thought the worker configuration did not include the 2ndary workers, my intention was to keep primary worker nodes as 2 and scale only with 2ndary workers.

waynez

@jangemmarAlso Scale up / down factor ranges from 0 to 1.0, correct?
Fraction of average pending memory in the last cooldown period for which to add workers. A scale-up factor of 1.0 will result in scaling up so that there is no pending memory remaining after the update (more aggressive scaling). A scale-up factor closer to 0 will result in a smaller magnitude of scaling up (less aggressive scaling).

Required. Fraction of average available memory in the last cooldown period for which to remove workers. A scale-down factor of 1 will result in scaling down so that there is no available memory remaining after the update (more aggressive scaling). A scale-down factor of 0 disables removing workers, which can be beneficial for autoscaling a single job.

jangemmar

@waynez yes that's correct. scaleUpFactor controls how aggressively the autoscaler scales up a cluster. You can specify a number between 0.0 and 1.0 to set the fractional value of YARN pending resource that causes node addition.

waynez

@jangemmar I don't want to scale with primary worker nodes, only secondary worker nodes.
So the following setting is not what I wanted.
"workerConfig": { "minInstances": 2, "maxInstances": 10, "weight": 1 },

What should I do for a policy with only 2 primary workers and secondary nodes scaling from 0 to 5?

waynez

Also got error setting the factor more than 1 on GCP console: @jangemmar
Scale up factor must be less than or equal to 1

Scale down factor must be less than or equal to 1

jangemmar

@waynez apologies for the confusion. Correct! scaleUpFactor can be set between 0.0 and 1.0. The secondary workers are created through a managed instance group, this means all the operations performed by compute engine that are part of the managed instance group are performed using the Google APIs service account for your project. If there is a permission issue with your service account, Dataproc logs will not have any errors corresponding to the creation failure of the secondary workers. They will also show up in the VM instance tab of the GCP console of that cluster without a green checkmark which indicates that the VM has not been created yet. Also, try adding the scaleDownFactor to your autoscaling policy to test.

{
  "id": "xyz",
  "name": "projects/unravel-dataproc/regions/us-central1/autoscalingPolicies/xyz",
  "basicAlgorithm": {
    "yarnConfig": {
      "scaleUpFactor": 1, 
      "scaleDownFactor": 1, 
      "gracefulDecommissionTimeout": "3600s" 
    },
    "cooldownPeriod": "120s" 
  },
  "workerConfig": {
    "minInstances": 2,  
    "maxInstances": 2, 
    "weight": 1
  },
  "secondaryWorkerConfig": {
    "maxInstances": 10,
    "weight": 1 
  }
}

Also, you can refer here for the autoscaling configuration recommendations.

Autoscaling not working for Dataproc GCE cluster