Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Enable Apigee Service Callout retries

I have an Apigee ServiceCallout that is occurring inside a shared flow. I am receiving  occasional ( 1 in 1000) EOF errors in that callout.  My best guess is the target server has stale connections. A retry would likely work on the second call.  I have created a second callout if the first one fails and it is working, however, I cannot find a way of logging the reason why the first call fails.  I am looking for advice

  • Is there a way of making a ServiceCallout and capturing connection error conditions? Would Javascript enable more error handling? 
  • Can I change the TargetServer or Connection properties to maybe alleviate the root cause of the EOF?
  • Can an enhancement be made to Apigee so the "ContinueOnError=true" would capture the "fault" information like the Error Flow would have?

Here is an example of the Error that occurs that I managed to capture in a trace.

 

id": "Error",
          "results": [
            {
              "actionResult": "DebugInfo",
              "accessList": [],
              "timestamp": "21-02-25 16:43:20:355",
              "properties": {
                "properties": [
                  {
                    "name": "error.cause",
                    "value": "eof unexpected",
                    "rowID": "___row188"
                  },
                  {
                    "name": "error.class",
                    "value": "com.apigee.kernel.exceptions.spi.UncheckedException",
                    "rowID": "___row189"
                  },
                  {
                    "name": "state",
                    "value": "PROXY_REQ_FLOW",
                    "rowID": "___row190"
                  },
                  {
                    "name": "type",
                    "value": "ErrorPoint",
                    "rowID": "___row191"
                  },
                  {
                    "name": "error",
                    "value": "Execution of ServiceCallout scp_callExternalService failed. Reason: eof unexpected",
                    "rowID": "___row192"
                  }
...

 

In order to make the second call, I set the "continueOnError" to true so that the error flow is not started and therefore no retries could be done.  However, the only flow variable set is the "servicecallout.xxx.failed"=true. The "fault" and "error" variable are only available in the Error flow.  Therefore I just have the equivalent of the ambiguous "Something has gone wrong".

Here is the simple prototype policies I am working with

 

...
<Step>
  <Name>scp_callExternalService</Name>
</Step>
<Step>
  <Name>jsp-processServiceResponse1</Name>
</Step>
<Step>
<Name>scp_callExternalService</Name>
  <Condition>servicecallout.scp_callExternalService.failed == true</Condition>
</Step>
<Step>
  <Name>jsp-processServiceResponse2</Name>
</Step>
...

 


Here is the policy config.  

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ServiceCallout async="false" continueOnError="true" enabled="true" name="scp_callExternalService">
  <DisplayName>scp_callExternalService</DisplayName>
  <Request variable="serviceRequest">
    <Set>
      <Verb>POST</Verb>
      <Path>/access-management/v1/authorization</Path>
    </Set>
    <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
  </Request>
  <Response>serviceResponse</Response>
  <Timeout>30000</Timeout>
  <HTTPTargetConnection>
    <Properties>
      <Property name="success.codes">1xx, 2xx, 3xx, 4xx</Property>
    </Properties>
    <LoadBalancer>
      <Algorithm>RoundRobin</Algorithm>
      <Server name="rest-cluster-a"/>
      <Server name="rest-cluster-b"/>
      <MaxFailures>5</MaxFailures>
      <ServerUnhealthyResponse>
        <ResponseCode>500</ResponseCode>
        <ResponseCode>502</ResponseCode>
        <ResponseCode>503</ResponseCode>
      </ServerUnhealthyResponse>
      <RetryEnabled>true</RetryEnabled>
    </LoadBalancer>
    <HealthMonitor>
      <IsEnabled>true</IsEnabled>
      <TCPMonitor>
        <ConnectTimeoutInSec>10</ConnectTimeoutInSec>
      </TCPMonitor>
      <IntervalInSec>60</IntervalInSec>
    </HealthMonitor>
  </HTTPTargetConnection>
</ServiceCallout>

 

I have a health check in place and I believe the default for keep-alives in 60 seconds which should keep any connection pool connections alive.
The unhealthy server and success codes are not in relevant for this error as it is happening on the connection with the target server, not a response from the target server. I have referenced https://cloud.google.com/apigee/docs/api-platform/deploy/load-balancing-across-backend-servers#setti... 
0 1 133
1 REPLY 1

I don't have a good answer for you here.  As you can see, when there is a non-response from the backend , you must resort to somewhat unnatural acts , or at least unusual acts, within Apigee in order to retry it from there. What I would suggest is, push the responsibility for retry outside of your Apigee proxy. 

  • create a wrapper service that is a passthrough-with-retry. Envoyproxy can do this; it has configuration for retries.  Or you could write your own.  If I were doing it, I'd use envoyproxy.  Then point Apigee to the envoyproxy facade. The retry happens from there.
  • push the retry responsibility to the client.
  • Wrap the sometimes-failing service with a different API proxy, that actually handles the fault and exposes it to your calling proxy.  This kind of reminds me of duct-tape, but it might serve your needs.

Separately,  I would recommend: 

  • diagnosing and eliminating the transient error, to the extent you can. (duh)
  • Raise a feature enhancement request with the Apigee team (via the Apigee support desk): the HttpTargetconnection should be able to perform a retry, independent of the Load Balancer, kinda like what Envoyproxy does. 
  • Also maybe raise a feature enhancement request, regarding setting of variables or diagnostics on the error condition when a ServiceCallout fails in the way you described, when continueOnError = true.  (Forget it; I just raised a EH for this, because I think it's the right thing for Apigee to do. internal ref b/399273045 . contact Apigee support if you want to track this request.  There's no telling if it will get prioritized.)