deadline exceeded when scaling-in

bluage_nakayama · 10-11-2023 09:21 AM

Hi,

We use Cloud Spanner and an auto-scaler for our system. However, when the system scales in, we encounter "deadline exceeded" errors. For instance, whenever the computing resources are reduced by 2K, these errors consistently occur.

Is there a way to fix this issue? Why doesn't the system scale in gracefully?

Thanks!

ms4446

When you scale in a Cloud Spanner instance, the system must redistribute data and perform other maintenance tasks. This can take some time, and if the system is under heavy load, it may not be able to complete these tasks before the deadline expires.

Another reason why Cloud Spanner might not scale in gracefully is that the auto-scaler may not be configured correctly. The auto-scaler has a number of parameters that control how it scales the system, such as the minimum and maximum number of nodes, the cooldown period, and the scaling method. If these parameters are not set correctly, the system may scale in too aggressively, which can lead to errors.

How to fix the issue:

There are a few things you can do to fix the issue of Cloud Spanner not scaling in gracefully:

Increase the scaling timeout: The auto-scaler has a timeout parameter that specifies how long it waits for a scaling operation to complete before returning an error. You can increase this timeout to give the system more time to scale in.
Reduce the scale-in ratio: The auto-scaler has a scale-in ratio parameter that specifies how many nodes are removed from the system when it scales in. You can reduce this ratio to scale in the system more slowly.
Configure the auto-scaler to use a more gradual scaling method: The auto-scaler has three scaling methods: stepwise, linear, and direct. The stepwise method is the most gradual, while the direct method is the most aggressive. You can configure the auto-scaler to use the stepwise method to scale in the system more slowly.
Optimize your workload: If your workload is causing the system to be under heavy load, you may be able to improve the scaling behavior by optimizing your queries or reducing the number of concurrent connections.

Additional tips:

Monitor the system: You should monitor the system when you are making changes to the auto-scaler configuration or scaling the system manually. This will help you to identify any problems early on.
Test the scaling behavior: You should test the scaling behavior of your system in a staging environment before making changes to the production environment. This will help you to identify and fix any problems before they impact your users.

Here are some specific steps you can take to troubleshoot the issue:

Check the Cloud Spanner logs to see what errors are occurring when the system scales in.
Use the Cloud Spanner monitoring console to see how the system is performing when it scales in.
Try increasing the scaling timeout.
Try reducing the scale-in ratio.
Try configuring the auto-scaler to use the stepwise scaling method.
Optimize your workload.

bluage_nakayama

Thank you @ms4446

Try increasing the scaling timeout.

Could you please tell me the exact config name of "scaling timeout"? I couldn't find it.

ms4446

Sorry for the confusion. Let me clarify. There is no explicit configuration parameter for scaling timeout in Cloud Spanner. However, you can increase the implicit scaling timeout by increasing the timeout for all RPCs (remote procedure calls).

To do this, you can set the maxRpcTimeoutMillis parameter in the Cloud Spanner client library to a higher value. The default value is 60000 milliseconds (60 seconds).

For example, to increase the scaling timeout to 120 seconds, you would set the maxRpcTimeoutMillis parameter to 120000.

Note: Increasing the scaling timeout may delay the time it takes for the system to scale in, but it can help to prevent deadline exceeded errors.

Here is an example of how to set the maxRpcTimeoutMillis parameter in the Java Cloud Spanner client library:

import com.google.cloud.spanner.DatabaseClient;
import com.google.cloud.spanner.SpannerOptions;

public class Main {
  public static void main(String[] args) {
    SpannerOptions options = SpannerOptions.newBuilder()
      .setMaxRpcTimeoutMillis(120000)
      .build();

    DatabaseClient client = DatabaseClient.create(options);

    // ... use the client ...
  }
}

You can also set the maxRpcTimeoutMillis parameter in the other Cloud Spanner client libraries, such as the Python and Go client libraries.

Once you have increased the scaling timeout, you should test the scaling behavior of your system to make sure that it is working as expected.