Solved: Re: Revision is deployed and traffic can flow, but...

Report Inappropriate Content · 08-27-2015 11:00 PM

As the title states. I have a free account.

adevegowda

We had a problem with our infrastructure which caused the issues with deployment/tracing of proxies. We have got this problem resolved now. We tested locally in a few of our orgs and confirmed that we are now able to deploy/trace proxies without any issues.

Can you please try deploy/undeploy/trace the proxies in your org now ?

If the problem persists, please do let us know.

Sincere apologies for the inconvenience caused.

View solution in original post

Report Inappropriate Content

So we have this issue from time to time. Somethings inside Apigee are not very forgiving of real-world network scenarios. Its worse when you are deploying to multiple datacenters from one management server (like we do in all of our Private Cloud installations...)

Here are the scenarios that have hit us to date:

1) message processor or router is somehow not able to talk to the database. (seems to have been generally resolved since 15.05) - resolved by restarting the MP or Router

2) management service is somehow not able to talk to all of the message processors and routers in the same transaction cycle (just had this problem this morning) - looks like a MP or Router problem but is only resolved by restarting the offending Management Service.

3) management service is unable to talk to one specific message processor or router in a transaction cycle (had this problem last in 15.07) - must restart Management Service.

4) management service is unable to open an rpc connection out (not seen on the network layer as an error, only w/in the software stack) (had this problem last in 15.07) - must restart Management Service.

5) message processor not able to correctly talk to zookeeper (this seems to have been generally resolved since 15.07) - Must restart MP or Router.

I have a series of tickets open with support and we are working through these problems one at a time. It is made more difficult by the tendency to blame "the network". While network conditions may in-fact be a problem - they will ALWAYS be a problem and complex systems must account for this.

Ill leave it up to the SW design experts to fuss over how - while i push the whys on them...

View solution in original post

sudheendras

Can you please share your Org and proxy names?

Report Inappropriate Content

Hi there, why would you need those? even if making yahoo weather api (tutorial) just by deploying raises the issue? per se, this added when I checked today, The revision is deployed and traffic can flow, but flow may be impaired. com.apigee.kernel.exceptions.spi.UncheckedException{ code = application.bootstrap.FailedToConfigure, message = Configuration failed, associated contexts = []}; and Call timed out

the part where there is an unchecked exception

sudheendras

It appears to be a problem of the underlying infra component. Hence knowing the Org name would help us identify the failing component.

It's a known issue and engineering team is fixing it... It should get resolved very soon.

Report Inappropriate Content

Oh I see, thanks for the info. the orgname is "hanzelgarcia", and all proxies are affected.

adevegowda

Dear @Hanzel Garcia,

We had a problem with our infrastructure which caused the issues with deployment/tracing of proxies. We have got this problem resolved now. We tested locally in a few of our orgs and confirmed that we are now able to deploy/trace proxies without any issues.

Can you please try deploy/undeploy/trace the proxies in your org now ?

If the problem persists, please do let us know.

Sincere apologies for the inconvenience caused.

Report Inappropriate Content

Hi there, Thanks for the information. I tested only about 3 minutes ago and confirmed that the fixed is working splendidly.

adevegowda

Dear @Hanzel Garcia,

Thanks for confirming that it is working fine now.

Report Inappropriate Content

We see this intermittently on our on-prem installation (4.15.07.00). Is it possible to characterize what the problem was "with our infrastructure"?

Report Inappropriate Content

I am very unhappy with being required to take remedial action on this issue - and only being able to discover it by having a proxy deployment fail.

Hopefully Apigee can come up with a better way to trap and fix this problem before it is just found by users and requires intervention to resolve. Also: getting a better network is not a solution :) I

Report Inappropriate Content

I couldn't agree more. We did a major proxy push last week and ended up with non-deterministic success and failure. Some proxies deployed just fine. Others failed. I then got into a state where I would undeploy a proxy only to be told that it wasn't deployed. Then I would attempt to deploy the proxy only to be told that it was already deployed.

I had to ask our operations team to restart the whole cluster over the weekend. This isn't exactly confidence inspiring in a system that touts eventual consistency and 12 factor apps.

Report Inappropriate Content

So i am 100% sure that some defensive coding could resolve this problem w/o a lot of trouble. To get this to be a priority though - people need to call their product manager and tell them this is a problem that they want some attention paid to.

I dont even bother to open tickets to demonstrate demand anymore.

Report Inappropriate Content

So we have this issue from time to time. Somethings inside Apigee are not very forgiving of real-world network scenarios. Its worse when you are deploying to multiple datacenters from one management server (like we do in all of our Private Cloud installations...)

Here are the scenarios that have hit us to date:

1) message processor or router is somehow not able to talk to the database. (seems to have been generally resolved since 15.05) - resolved by restarting the MP or Router

2) management service is somehow not able to talk to all of the message processors and routers in the same transaction cycle (just had this problem this morning) - looks like a MP or Router problem but is only resolved by restarting the offending Management Service.

3) management service is unable to talk to one specific message processor or router in a transaction cycle (had this problem last in 15.07) - must restart Management Service.

4) management service is unable to open an rpc connection out (not seen on the network layer as an error, only w/in the software stack) (had this problem last in 15.07) - must restart Management Service.

5) message processor not able to correctly talk to zookeeper (this seems to have been generally resolved since 15.07) - Must restart MP or Router.

I have a series of tickets open with support and we are working through these problems one at a time. It is made more difficult by the tendency to blame "the network". While network conditions may in-fact be a problem - they will ALWAYS be a problem and complex systems must account for this.

Ill leave it up to the SW design experts to fuss over how - while i push the whys on them...

Report Inappropriate Content

Wow, probably one of the best answers I've ever read here! Very much appreciated.

Report Inappropriate Content

i feel that people do not complain about these issues enough as i get a lot of pushback about resolving them...

adas

@Benjamin Goldman You have pretty much captured all the failure scenarios that affect the api proxy deployments. There are a few more like Cassandra replication delay between regions which causes failures in finding a resource file or something else referred in the proxy bundle and few other cases. Like you rightly said, there are multiple issues around proxy deployment and we are systematically working through them. We tend to fix these first for our cloud and bake them into the platform before pushing them out as patches to the on-premise customers, that's been the approach all this while. This results in the fixes being delayed to "OPDK" aka "private cloud" and I can totally understand the frustrations that it may cause.

While we are working through the deployment issues there are still a few of them out there which need to be addressed incrementally. In parallel, we are also in the process of trying to redesign this piece to account for all the failure cases and network issues that you just mentioned. This would take a while to implement, roll out and stabilize on our cloud before they can be releases as patches to our private cloud customers. In general you should see a lot of these being addressed in 4.16.01.xx and 4.16.05.00 (yet to be released). Like always, thanks for your great insights and bringing up these critical issues and sharing your point of view.

Report Inappropriate Content

@Benjamin Goldman @arghya das - Do either of you know if any or all of these issues have been resolved in the 4.16.01 or 4.16.05 releases?

gauravjain_59

Hi

We are facing this issue in 4.17.0.1 Private Cloud 9 node cluster also.

Revision is deployed and traffic can flow, but flow may be impaired. Could not locate a resource.

But the resource is present in org but still it fails and gives above error. Undeploying and redeploying also doesn't solve the issue.

Attaching the screenshot for the same.

gateway-proxy-error.jpg

krupalpatelusa · The issue is still available in 4.18.05.00.

The issue is still available in 4.18.05.00.

Revision is deployed and traffic can flow, but flow may be impaired error, undeploying and redeploying doesn't solve the issue