Hi,
I’m working on implementing a retry mechanism in Apigee using two different target servers. When the response is 200, the requests are distributed between the target servers using a round-robin algorithm, as expected.
However, when the response code is in the 5XX range, the retry does not occur, and I encounter the error below. Can anyone help me understand why the retry isn't working in this case?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TargetEndpoint name="default">
<Description/>
<FaultRules/>
<PreFlow name="PreFlow">
<Request/>
<Response>
</Response>
</PreFlow>
<PostFlow name="PostFlow">
<Request/>
<Response>
<Step>
<Name>AM-store-targetInfo</Name>
</Step>
</Response>
</PostFlow>
<Flows/>
<HTTPTargetConnection>
<Properties/>
<LoadBalancer>
<Algorithm>RoundRobin</Algorithm>
<Server name="TestTargetServer1"/>
<Server name="TestTargetServer2"/>
<MaxFailures>3</MaxFailures>
<RetryEnabled>true</RetryEnabled>
<ServerUnhealthyResponse>
<ResponseCode>404</ResponseCode>
<ResponseCode>500</ResponseCode>
<ResponseCode>502</ResponseCode>
<ResponseCode>503</ResponseCode>
</ServerUnhealthyResponse>
</LoadBalancer>
<Path/>
</HTTPTargetConnection>
</TargetEndpoint>
@learnapigee_ca - thats because Apigee treats the error as a problem and marks the server down from rotation. This page has all the info you need. Like the doc suggests, add the HealthMonitor so that Apigee can try probing the health check endpoint and bring it back to life and start load balancing between them
Yes, I’ve reviewed the provided page. I’m currently working on a POC to validate if retry logic is functioning as expected. Below is the target endpoint configuration file I’m using.
For testing, I’ve configured the target servers with https://postman-echo.com/status/200 to observe if the round-robin algorithm rotates between the defined target servers. Similarly, I’m testing how it behaves when the response status is in the 5XX range. However, the results aren’t as expected.
It would be helpful if you could shed some light on this behavior.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TargetEndpoint name="default">
<Description/>
<FaultRules/>
<PreFlow name="PreFlow">
<Request/>
<Response/>
</PreFlow>
<PostFlow name="PostFlow">
<Request/>
<Response>
<Step>
<Name>AM-store-targetInfo</Name>
</Step>
</Response>
</PostFlow>
<Flows/>
<HTTPTargetConnection>
<LoadBalancer>
<Algorithm>RoundRobin</Algorithm>
<Server name="TestTargetServer1"/>
<Server name="TestTargetServer2"/>
<MaxFailures>1</MaxFailures>
<RetryEnabled>true</RetryEnabled>
<RetryPolicy>
<MaxRetries>2</MaxRetries>
<!-- Number of retries for 5xx errors -->
<ResponseCode>500</ResponseCode>
<ResponseCode>502</ResponseCode>
<ResponseCode>503</ResponseCode>
</RetryPolicy>
<ServerUnhealthyResponse>
<ResponseCode>502</ResponseCode>
<ResponseCode>503</ResponseCode>
</ServerUnhealthyResponse>
</LoadBalancer>
<Path/>
</HTTPTargetConnection>
</TargetEndpoint>
However, the results aren’t as expected.
I don't know specifically how you tested, and how the results you're seeing different from what you're expecting...
but this screencast from a couple years ago walks through the Target Server and health monitoring in Apigee.
Maybe it will help.