Solved: KVM and related cache degradation

perry_19

Hi all,

I am facing a kind of degradation using KVMops policy while putting and retrieving values from kvm (and its cache). I am using Apigee Edge hybrid.

I am using the KVM to save a sessionId when a user performs a successful log-in (a jwt will be generated containing the sessionId) and then I retrieve it when he tries to invoke a backend service in order to match its value against the one contained in the jwt (a simple way to permit a user to have only one session alive, and a new session will "steal" the old one).
Basically I have this KVMop on a response api proxy post flow after the user is successfully logged-in, to save the sessionId into the kvm:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<KeyValueMapOperations enabled="true" name="KVMop-postSessionID" mapIdentifier="sessionCache" async="false" continueOnError="false">
    <DisplayName>KVMop-postSessionID</DisplayName>
    <Properties/>
    <ExclusiveCache>false</ExclusiveCache>
    <ExpiryTimeInSecs>21600</ExpiryTimeInSecs>
    <Put override="true">
        <Key>
            <Parameter ref="username"/>
        </Key>
        <Value ref="code"/>
    </Put>
    <Scope>environment</Scope>
</KeyValueMapOperations>

And when he tries to call a service, on the request Preflow I retrieve from the KVM the sessionId (using this policy) and then perform the check i mentioned before:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<KeyValueMapOperations continueOnError="false" enabled="true" name="KVMop-getSessionID" mapIdentifier="sessionCache">
    <DisplayName>KVMop-getSessionID</DisplayName>
    <Properties/>
    <Get assignTo="sessionID">
        <Key>
            <Parameter ref="jwt.username"/>
        </Key>
    </Get>
</KeyValueMapOperations>

The problem: sometimes some users failed the check "JwtSessionId = KvMSessionId" when they legitimately call a service after a login. What I see, in debug session, is that the put KVMop and write jwt is performed but when the get KVMop tries to retrieve the sessionId for the user it gets an old value and not the fresh one (as if there is some latency in writing the sessionId or as if a false positive is returned from the put KVMop meaning that the value is not written in KVM but the policy response is ok).

The KVM storing the sessionIds is not little, it has around 2k entries. But even purging it, didn't solve the sporadic errors.
Another strange thing: after the login, some services are called in parallel and what I saw is that some fails the check others not (as if there is an underlying kvm or kvm cache layer with structures working and not working but it's just an assumption and i don't know very well the apigee architecture).

I have also found this on https://cloud.google.com/apigee/docs/release/known-issues :

@dchiesa1 @anilsagar maybe do you have any clues or hints?

Many thanks if you can help me or analyze my problem 🙂

paul-wright

Hi Perry,

Thanks for reaching out about this. Addressing sessionID management can have some complexities.

First, I recommend looking at using the Cache policies in place of the KVM policies. Whether you choose to change will depend on your requirements. The reason I mention this is that Cached items (with expiration times) automatically clean up from storage. While a KVM exists until it is deleted. I don't know from the description whether the usernames change or grow in number frequently. I see the reference to 2k entries, so this may not yet be a concern. If they do, it may be worth changing to help reduce storage bloat.

Additionally, changing to a cache based model will allow the session to timeout in the database as well as in the JWT adding a bit more security to the system. The cache's expiration can be set to the time the JWT expires.

Specifically to your questions however, I have a few thoughts that may apply based on the described situation.

KVM cache expiration is set to 6 hours. What this means is that each processing pod (Message Processor) within Apigee will read the KVM and cache the KVM's state at read time for 6 hours. The cache time starts at the point it was read into that specific pod. This can be staggered across pods as calls come in.
In Apigee Hybrid, there are typically a minimum of 2 processing pods, however there could be many more depending on load and multi-regional configurations.
KVM Put, resets the cache, but only on the processing pod that receives the PUT. See PUT under ExpiryTimeInSeconds: https://cloud.google.com/apigee/docs/api-platform/reference/policies/key-value-map-operations-policy...

From what has been described, think you are running into #3. The user may be re-logging in or something is causing a new JWT to be created. When the new JWT is created, the pod that processes the new JWT will know about it. However any other pods that may still have the old JWT in cache will not know there is a new JWT until the cache expires.

So there are a couple of approaches that could help.

Reduce the KVM's ExpiryTimeInSeconds. I cannot say what value to use. You may need to consider typical session lengths, frequency of logout and login sequences (and whether you clear the JWT from the KVM on logout), and performance (if set very low the system will have to read from storage often).
Note: it is important to point out that the KVM itself does not expire. It will be in the system. So even if the cache expires, a pod can still read the latest value in the KVM, essentially forever, unless you are deleting it at some point.
Take a look at using the Cache policies: https://cloud.google.com/apigee/docs/api-platform/reference/policies/populate-cache-policy This may have you more flexibility.

Cheers,

View solution in original post

paul-wright

Hi Perry,

Thanks for reaching out about this. Addressing sessionID management can have some complexities.

First, I recommend looking at using the Cache policies in place of the KVM policies. Whether you choose to change will depend on your requirements. The reason I mention this is that Cached items (with expiration times) automatically clean up from storage. While a KVM exists until it is deleted. I don't know from the description whether the usernames change or grow in number frequently. I see the reference to 2k entries, so this may not yet be a concern. If they do, it may be worth changing to help reduce storage bloat.

Additionally, changing to a cache based model will allow the session to timeout in the database as well as in the JWT adding a bit more security to the system. The cache's expiration can be set to the time the JWT expires.

Specifically to your questions however, I have a few thoughts that may apply based on the described situation.

KVM cache expiration is set to 6 hours. What this means is that each processing pod (Message Processor) within Apigee will read the KVM and cache the KVM's state at read time for 6 hours. The cache time starts at the point it was read into that specific pod. This can be staggered across pods as calls come in.
In Apigee Hybrid, there are typically a minimum of 2 processing pods, however there could be many more depending on load and multi-regional configurations.
KVM Put, resets the cache, but only on the processing pod that receives the PUT. See PUT under ExpiryTimeInSeconds: https://cloud.google.com/apigee/docs/api-platform/reference/policies/key-value-map-operations-policy...

From what has been described, think you are running into #3. The user may be re-logging in or something is causing a new JWT to be created. When the new JWT is created, the pod that processes the new JWT will know about it. However any other pods that may still have the old JWT in cache will not know there is a new JWT until the cache expires.

So there are a couple of approaches that could help.

Reduce the KVM's ExpiryTimeInSeconds. I cannot say what value to use. You may need to consider typical session lengths, frequency of logout and login sequences (and whether you clear the JWT from the KVM on logout), and performance (if set very low the system will have to read from storage often).
Note: it is important to point out that the KVM itself does not expire. It will be in the system. So even if the cache expires, a pod can still read the latest value in the KVM, essentially forever, unless you are deleting it at some point.
Take a look at using the Cache policies: https://cloud.google.com/apigee/docs/api-platform/reference/policies/populate-cache-policy This may have you more flexibility.

Cheers,

perry_19

Really appreciated your analysis and idea/solution provided! Thank you Paul!