Is there any reason why 'Connection reset by peer'...

younghunyun · 01-30-2023 06:25 PM

Hello,

I found the logs like below in my App Engine service A:
...The connection observed an error, the request cannot be retried as the headers/body were sent\nio.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Service A is an application implemented with spring boot, and I expected it to be a netty connector-related problem.

To check why the log is occurring, the test was performed by configuring the environment below:

Service A is the same source code, but the environment running was tested differently. As a result of several tests, the 'Connection reset by peer' log occurred 610 or 620 seconds after the last LB domain call only in TEST CASE A.

I searched references and I can confirmed that the issue is resolved if I use a provider that specifies the value of the maxIdleTime, evictInBackground options when I create a HttpClient or WebClient object. But I did not understand why the 'Connection reset' logs only occurs with the service deployed in the App Engine environment. If Service A's logic is implemented in python, TEST CASE A, B, C, and D all operate normally without the occurrence of "Connetion reset" logs.

Test code is simple like below:

import reactor.netty.http.client.HttpClient;
...
HttpClient client = HttpClient.create();
return client.get()
             .uri("https://lb.bespinlab.kr/api/main/ext?from=demo-provider-none")
             .responseContent()
             .aggregate()
             .asString()
             .block();

The "lb.bespinlab.kr" is attached domain to LB static IP.
The "/api/main/ext" API is serving in the App Engine service B.

I understand that the timeout of the HTTPS Load Balancer of GCP is 600 seconds and this cannot be changed. Therefore, I understand that the client that calls LB should set the timeout. But, I think it is strange that the same source works normally when deployed in a local or Cloud Run environment, and only occurs when deployed in an App Engine environment. I looked for any limits related to the connection to the App Engine environment, but I could not find any docs reference.

Of course, the problem can be solved by setting up the provider with options. But to avoid similar phenomena from occurring, is there any idea or approach to find the cause?

christianpaula

Hi @younghunyun,

The "Connection reset by peer" logs are likely being caused by the HTTPS Load Balancer in GCP having a timeout of 600 seconds. The client that is calling the Load Balancer should have its own timeout set to be less than the Load Balancer's timeout to avoid this issue. The issue is resolved in the python environment because it uses a provider that sets the maxIdleTime and evictInBackground options when creating a HttpClient or WebClient object, which sets a timeout for the client.

To avoid similar issues in the future, you can try to set the timeout for the client in your spring boot application. Additionally, monitoring the Load Balancer's logs and performance metrics can help detect any other factors that may cause this issue.

Documentation references:

GCP App Engine documentation: https://cloud.google.com/appengine/docs
GCP HTTPS Load Balancer documentation: https://cloud.google.com/load-balancing/docs/https/
GCP Load Balancer logging and monitoring: https://cloud.google.com/load-balancing/docs/logging-monitoring/
GCP App Engine quotas and limits: https://cloud.google.com/appengine/quotas
GCP App Engine troubleshooting guide: https://cloud.google.com/appengine/docs/troubleshooting

Thank you

younghunyun

Thanks for your answer.

As you explained, I understand that the cause of the log is reset the client's connection due to the 600 seconds timeout setting of HTTPS LB. That's why I performed test cases to verify it.

If the cause is the timeout of HTTPS LB clearly, the same log should have occurred in my test cases B, C, and D, but it only occurred in A. Test cases A, B, C, and D differ only in environments that run JAR files that build the same source. The source code is a simple code without using provider. That's why I want to know if there's a difference between running with java/docker in the local PC environment and deploying it to App Engine.

- Test Case A : Deployed service in App Engine → HTTPS LB
- Test Case B : "java -jar APP.jar" in Local PC→ HTTPS LB
- Test Case C : "docker run BUILD_IMAGE_FROM_JAR" in Local PC→ HTTPS LB
- Test Case D : Deployed service in Cloud Run → HTTPS LB

- Test results are below:

Time delay after previous LB call	A	B	C	D
0 sec	200 OK	200 OK	200 OK	200 OK
60 sec	200 OK	200 OK	200 OK	200 OK
300 sec	200 OK	200 OK	200 OK	200 OK
600 sec	200 OK	200 OK	200 OK	200 OK
610/620 sec	500 Error	200 OK	200 OK	200 OK
1200 sec	-	200 OK	200 OK	200 OK
6000 sec	-	200 OK	200 OK	200 OK

If I can find the reference for a connection setting in an App Engine environment that is different from the local environment, the cause analysis would be clear, but I could not find it. Is there anything I can check more about the App Engine environment?

svazquezpetrini

we have de same error. 😞 any suggestion?

younghunyun

Hi,
In my case, the 'Connection reset by peer' issue was related timeout settings of client and gcp resources.

In the condition with Reactor Netty version: 1.0.10 and Spring boot version: 2.5.4, the existing channel is disconnected and a new channel is created when a request occurs after 10 minutes of idle time. The service was modified by adjusting the setting value for connection.

Note that the HTTP LB timeout and keepalive timeout values for GCP cannot be changed to 10 minutes.

And when the connection is terminated in LB, the connection of the client is also terminated, and an error occurs when a new request occurs at that time, so we set the maximum interval setting less than 10 minutes, which is LB's idle timeout

Depending on the client configuration or environmental conditions, this may not be the right answer. You need to configure a testable environment and change the settings to different conditions.

Is there any reason why 'Connection reset by peer' logs are coming out in an App Engine service?