Problem with accessibility to gke control plane

nnaumov · 04-11-2023 05:15 AM

Hello!

We have a problem with accessibility to gke cluster out of this cluster. For example, sometimes (at least 10 times per 24 hours - see the first screenshot below on which 1 means that “kubectl get nodes” was crushed because of the default kubectl timeout)

we can’t connect to this cluster via kubectl(from cloud shell too). In this time kubectl’s requests to gke cluster crashes with timeout exceeded error, but it doesn’t affect workload which works in this cluster, so for example, we can get response from ingress-controller which is in this cluster, but can’t get response from k8s control plane. console.cloud.google.com also can’t get an information from control plane in this time (see the screenshot below).

More info about this gke cluster:
Location type: zonal
Control plane zone: europe-west1-c
Control plane version: 1.24.10-gke.2300
Nodes zone: europe-west1-c
Nodes version: 1.24.10-gke.2300
Nodes image type: Ubuntu with containerd (ubuntu_containerd)

We also have another gke clusters in this location, which work fine.

What should we do for fixing this problem?

munish1

Please indicate if you are running a private cluster or a public cluster. If you are running a private cluster, do you have the public endpoint enabled? Are you running the kubectl from a machine that is authorized to connect to the cluster? As you can see, this question needs quite a bit of information. Another possibility is that the gke cluster was under maintenance when you tried to connect to the control plane.

nnaumov

This cluster (as all our clusters) have "External endpoint: disabled". We use just internal endpoint to access the control plane (this feature is implemented in our network infrastructure).

@munish1 wrote:
Are you running the kubectl from a machine that is authorized to connect to the cluster?

Yes. When the problem happens we also can't connect to the cluster from cloud shell (because of timeout exceeded error).

@munish1 wrote:
Another possibility is that the gke cluster was under maintenance when you tried to connect to the control plane.

As I wrote before, we have this problem about 10 times per 24 hours. Each period of this problem usually lasts about 20 minutes. If I understand you correctly, you suppose that GKE cluster can be under maintenance so often?

munish1

`If I understand you correctly, you suppose that GKE cluster can be under maintenance so often` No, The maintenance CANNOT be that often.

Could you share the screenshot of your cluster configuration? Internal endpoints shouldn't be accessible to anyone outside the VPC including cloudshell. Please share cluster configuration

nnaumov

That part of cluster configuration do you want to see? It's not a network issue. Usually "kubectl" can connect to this cluster (from cloud shell too), but when the problem happens it can't. Usually I can get information about node pool in console.cloud.google.com, but when the problem happens I can't.
See the screenshot please. It's okey:

It's not okey:

munish1

@nnaumov I would need your GCP project number and the name of the cluster to dig deeper. If you have access to open a support case, I would recommend you do so. If you don't, I would need the mentioned information to proceed further

nnaumov

Sorry, I cann't open a support case. Could you make this issue private please? I am not sure that information about our GCP project number and another technical details is not a sensitive data.

munish1

Thank you for providing the cluster related details. I have opened an internal bug with the product team. I will update this once I hear back

munish1

@nnaumov Can you please provide a few specific timeslots when this happened? Please be specific along with the timezone for the start and end times. This will help with engg team

nnaumov

In the last 48 hours problems happened at least 25 times.
for example at:
12 april 2023: 21:16-21:29, 19:26-19:40, 18:18-18:32
11 april 2023: 13:40-15:00
UTC+3 time zone.

munish1

Thanks @nnaumov . Just a quick update

Problem identified internally, deeper analysis is currently underway.

munish1

Config change has been pushed a few hours ago. Please let us know if you see a difference in timeouts.

nnaumov

Problem happened last time at 06:00-06:20.

I marked problematic timeouts on the screenshot:

UTC+3 time zone

munish1

Thank you, so the problem is still happening but the frequency has gone down? Is that correct?

nnaumov

Yes, it is correct. Last night frequency has gone down. This problem usually happen at night.

munish1

@nnaumov We have pushed another config change. We see a dramatic improvement in the internal metrics. Please confirm if you are still seeing the timeouts.

Also list which specific kubectl commands are timing out if any

nnaumov

It works fine, thanks. We didn't have this problem last 3 days. What did you do for fixing it?

munish1

That's awesome. Control plane was overloaded with API calls. We had to tweak some settings to re-route/balance the traffic to our backends.

Please "like" and accept the solution if this resolved the problem

nnaumov

1)Was control plane overloaded with API calls of another GKE users with which we share it?
2) The problem is still relevant, but not so frequently. It happened at 19:01-19:09 17 april and 09:56-10:12 18 april (UTC+3 timezone). Could you fix it? We have another gke clusters in this region and their control plane work without any problems.

munish1

Somehow, I missed this update.

1) No, GKE control plane is a dedicated one. In this specific case, API calls are getting generated from your cluster only. The more kubernetes objects you have such as secrets, pods, configmaps etc, the more number of API calls you have. These API calls are made frequently by the internals to keep the cluster in sync.

2) Your GKE clusters must be smaller in size. Both in terms of compute capacity and the number of objects. Pls confirm

Please monitor the timeouts and let us know the exact timeframe for us to continue to fine tune the traffic

nnaumov

1) Could I know API calls limits in gke? And how can I check current API calls count?
2) Yes, you are right. Our clusters which work fine are smaller. Now If we clean some useless objects we will not know that count of api request really gone down, so answer on the first question please.

The last time the problem happened was on April 20 at 22:02-22:18 (UTC+3 timezone)

munish1

Hey @nnaumov, 1) Those are internal metrics that we can't share.

Short answer is yes, the lower the number of overall objects, the better distributed the API calls would be.

nnaumov

@munish1 , hello.
We had a problems on 23 april 05:04-05:19 and today (24 april) 08:28-08:44. Could you fix it please?
What can we do on our side exclude cleaning it for fixing this problem? Our cluster is not very big. I think that many of your clients have a bigger cluster and probably they don't have this problem.

nnaumov

Hey @munish1!
What about my last questions?

semantixjulio

dear munish1

I have the same problem, how do I open a case with you, I have an impact on the production of my environment