Solved: Re: Selected software image version '2.1.24-s8s-sp...

jeyob · 11-27-2023 09:09 AM

Hi,

I have a dataproc serverless batch job using the 2.1 Runtime version.
Looking to use the new autoscaling version on my job (i.e.

"spark.dataproc.scaling.version": "2") but when submitting the job it fails instantly with:

"Selected software image version '2.1.24-s8s-spark' does not support control plane decommissioning"

I can also mention that I am using a custom image (according to instructions on the docs)

Any idea what might be causing this?

ms4446

I'm glad to hear that the new patch version 2.1.27 resolved the control plane error for your Dataproc Serverless setup.

Regarding the verification of the autoscaling version:

Observe Scaling Behavior: The most practical way to infer the autoscaling version is by observing how your jobs scale. Autoscaling version 2 should show more efficient and responsive scaling, especially in reducing resources when they're not needed.
Check Documentation and Release Notes: Keep an eye on Google Cloud's documentation and release notes for any updates or tips on verifying autoscaling versions.
Conduct Experiments: If possible, test with controlled workloads to see how the autoscaler responds under different conditions.

Staying updated with Google Cloud's releases and monitoring your system's performance will help you make the most of Dataproc Serverless.

View solution in original post

ms4446

The issue you're encountering with Dataproc Serverless appears to be related to the compatibility of your custom image ("2.1.24-s8s-spark") with the autoscaling version 2, specifically its support for control plane decommissioning. While autoscaling version 2 is available for Spark runtime version 2.1, your custom image may not include the necessary components for this feature.

Here are some steps you can take to resolve this issue:

Check for Updated Custom Images: Before building a new custom image, verify if there's an updated version of your current custom image that supports control plane decommissioning. This could be a simpler solution than creating a new image from scratch.
Consider Using an Official Dataproc Serverless Image: If an updated custom image is not available, you might want to use an official Dataproc Serverless image for Spark 2.1, which is likely to have built-in support for the latest features, including autoscaling version 2.
Rebuild Your Custom Image: If you prefer to continue using a custom image, you can rebuild it with the necessary components for control plane decommissioning. Detailed guidance on building custom containers for Dataproc Serverless can be found in the official documentation.
Use Autoscaling Version 1 as a Temporary Measure: As an interim solution, you can switch to autoscaling version 1 by setting "spark.dataproc.scaling.version" to "1". This version does not require control plane decommissioning and might be compatible with your current custom image.

Additional Resources:

For more information on autoscaling in Dataproc Serverless for Spark, visit the autoscaling documentation.

jeyob

Thanks for the quick reply @ms4446
I can also add that I did a simple change to degrade the spark runtime version to 2.0 (

specifically 2.0.45)

and that seem to have worked fine with "spark.dataproc.scaling.version= 2"!

So I am not sure its necessarily the compatibility of the custom image with "spark.dataproc.scaling.version=2"?

ms4446

Hi @jeyob ,

Yes, you are correct in identifying the issue as related to the compatibility of the Spark runtime version with the new autoscaling feature, specifically the requirement for control plane decommissioning. However, it appears that the issue might be more specifically tied to the particular Spark runtime version you were using (2.1.24) rather than the custom image as a whole.

Since downgrading to Spark runtime version 2.0.45 resolved the issue with "spark.dataproc.scaling.version=2", it suggests that this earlier version is compatible with the autoscaling requirements, or that the specific 2.1.24 version you used has some limitations or missing components related to autoscaling.

To address this, you have a couple of options:

Update to a Different 2.1.x Version: Instead of using 2.1.24, you might try a different patch version within the 2.1.x series that supports control plane decommissioning. This could maintain the benefits of the 2.1 runtime while ensuring compatibility with autoscaling version 2.
Create a New Custom Image: If you prefer to stick with the 2.1.24 version, consider creating a new custom image based on this version, ensuring that it includes all necessary components for control plane decommissioning. You can refer to the Dataproc Serverless documentation for custom containers for guidance.
Update Batch Job Submission: Once you have the appropriate custom image, update your Dataproc batch job submission command to use this new image. For example:

--properties=dataproc:imageVersion=<your-custom-image-version>

Replace <your-custom-image-version> with the version number of your new custom image.

By following these steps, you should be able to submit your Dataproc Serverless batch job without encountering the previous error message.

I hope this provides a clearer path forward!

jeyob

@ms4446 thanks for the explanation.

Some follow up questions:

If wanted to go with another patch version, how do I achieve this? When trying to do that e.g. 2.1.23 (instead of 2.1.24) I get this error

ERROR: (gcloud.dataproc.batches.submit.pyspark) INVALID_ARGUMENT: The subminor version specification is not supported for the 2.1 runtime version
Not sure of your 2) point on creating a new custom image. Based on my understanding of the Custom container for Dataproc Serverless the Spark Runtime is mounted into the custom image. Don't think I understand what it would mean to create a custom image based on 2.1.24 image since that appears to be something I can't control.

Thanks again, not too familiar with control plane concept in general and what changes I would need to facilitate this.

Also is there a good way to verify/validate that autoscaling version 2 is enabled?

ms4446

Hi @jeyob ,

1. Using Another Patch Version
The error message you're encountering when trying to specify a subminor version (like 2.1.23 instead of 2.1.24) suggests that Google Cloud Dataproc Serverless does not support specifying such detailed version granularity for the runtime. This limitation means you can't directly choose a specific patch version if it's not explicitly offered by Google Cloud.

Possible Solution: Since you can't specify subminor versions, you're limited to the versions explicitly offered by Google Cloud. You can check the available versions in the Google Cloud Console or using the gcloud command-line tool to see if there are alternative versions you can use.

2. Creating a New Custom Image
Regarding creating a new custom image, you're correct that the Spark runtime is mounted into the custom image in Dataproc Serverless. When I suggested creating a new custom image, it was under the assumption that there might be additional configurations or dependencies that you could include in your custom container to ensure compatibility with the autoscaling feature. However, if the runtime itself (which is mounted) does not support certain features, then customizing the container might not resolve the issue.

Clarification: If the limitation lies within the Spark runtime version itself, then customizing the container won't help. In this case, you're dependent on the versions and features supported by Google Cloud's provided runtimes.

3. Understanding Control Plane Decommissioning
The concept of "control plane decommissioning" in the context of Dataproc Serverless autoscaling is a bit complex. It generally refers to the ability of the control plane (the management layer of the cluster) to dynamically scale down resources when they are no longer needed, in a way that doesn't disrupt running jobs. This is a key feature for efficient autoscaling.

4. Verifying Autoscaling Version 2 is Enabled
To verify that autoscaling version 2 is enabled, you can:

Check Job Configuration: When you submit a job, ensure that the property spark.dataproc.scaling.version is set to 2. This should be part of your job submission command or configuration file.

Monitor the Job: Once the job is running, you can monitor its behavior in the Google Cloud Console. Autoscaling version 2 should exhibit more dynamic scaling behavior compared to version 1, particularly in how it scales down resources.

Logs and Metrics: Check the logs and metrics of your Dataproc Serverless job. There might be specific logs or metrics that indicate which version of autoscaling is being used.

Note: The exact steps to verify this might vary based on the tools and interfaces you are using (Google Cloud Console, gcloud CLI, etc.).

Given these points, your best course of action might be to work with the versions available to you and monitor Google Cloud's updates for any changes in supported runtime versions or features. If the issue persists or if you need more specific guidance, contacting Google Cloud support could provide more tailored assistance.

jeyob

Thanks again for the quick reply and suggestions!

it seems a new patch version 2.1.27 was just recently released, this gets automatically picked for the 2.1.

And the control plane error I was seeing before has disappeared 🙂

I tried checking for the version as you suggested, but unfortunately the logs don't seem to reveal what version of the autoscaler that is being used

Checked cloud logging logs (logname: autoscaler) but that doesn't include any version details.
Also tried to check the spark ui for any details regarding this. But it doesn't seem to be a value set under Environment -> Spark configuration

But I'll keep on eye on how scaling may be differ between version 1 and 2 for the same workload

ms4446

I'm glad to hear that the new patch version 2.1.27 resolved the control plane error for your Dataproc Serverless setup.

Regarding the verification of the autoscaling version:

Observe Scaling Behavior: The most practical way to infer the autoscaling version is by observing how your jobs scale. Autoscaling version 2 should show more efficient and responsive scaling, especially in reducing resources when they're not needed.
Check Documentation and Release Notes: Keep an eye on Google Cloud's documentation and release notes for any updates or tips on verifying autoscaling versions.
Conduct Experiments: If possible, test with controlled workloads to see how the autoscaler responds under different conditions.

Staying updated with Google Cloud's releases and monitoring your system's performance will help you make the most of Dataproc Serverless.