One of the headline new features in OpenAM 13 is support for Stateless Sessions, which allow for essentially unlimited horizontal scalability of your session infrastructure. This is achieved by moving session state out of the data store and placing it directly on the client as a signed and encrypted JWT. Any server in the cluster can then handle any request to validate a session token locally by simply validating the signature on the JWT and checking that the token has not yet expired (using the expiry timestamp baked into the token itself). Stateless sessions are not in themselves a new concept, and there are a handful of implementations out there. You may be thinking “Great! Where do I sign?”, but there has been an Achilles’ heel with stateless that has held it back from being truly production-ready — how to handle logout. The general advice is that stateless logout is very hard or impossible. Well, we’re not afraid of a bit of hard work at Forgerock, so we decided to solve that problem. In this post I’ll tell you how we did it.
Why do we want to logout anyway?
Before we get into the technical details, we should step back and ask why we need to handle logout anyway? Does it matter? If you care about security and usability, then the answer should be yes.
The purpose of a session cookie is to prevent a user having to re-authenticate for every single request they make to a system. Instead the user authenticates once and we provide them with a secure time-limited token that proves that they have authenticated. The user then simply presents this token with every request and the system then checks the token is still valid. As an added bonus, we can also associate session state with that token, but this is not the primary purpose. The drawback of this approach is that if somebody manages to steal this session token then they can act as that user until the token expires.
Assuming that we cannot completely eliminate the possibility of token hijacking (and the history of computer security suggests that we cannot in a completely foolproof way), then we should take steps to limit the possible damage that could occur. The most obvious way is to limit the time-window in which an attacker can make use of the token:
- Firstly, we should require re-authentication for any operation that allows the attacker to extend the time-window, for example changing the user’s password would allow indefinite access to their account, so we require re-authentication for that (OpenAM allows any user profile attributes to be protected in this way).
- Secondly, we can shorten the expiry time on all session tokens, but this is a trade-off of security against user frustration at having to re-authenticate frequently and potentially losing work in progress.
- Finally, we can allow longer session expiry times, but allow a user to explicitly indicate when they are finished working and to invalidate the session in that case. We can also trigger invalidation if the session has been idle for a certain time period. This is the case that explicit logout addresses, and is the norm for most applications as it provides a more acceptable balance of security and usability.
Stateless Logout
In a stateful session architecture, logout is straightforward: we simply remove the session from internal storage and delete the cookie from the user agent. We can do the latter in a stateless architecture but if the cookie has already been stolen then this achieves nothing as we cannot tell tell that the cookie has been stolen.
We could place restrictions on the cookie, such as tying it to a particular IP address, but in a world of mobile clients it is not unusual for the IP address to change legitimately during a session as the client connects to different networks. (OpenAM does support this mode too, but this is primarily for protecting agent sessions).
It seems no matter what we try we end up needing some state on the server to support logout. But how much do we need? In the stateful model we store all active sessions on the server. One alternative would be to instead store all inactive sessions. Initially this may seem like a bad idea: we might expect the number of active sessions to stay roughly constant within some bounds, but surely the number of inactive sessions will grow unbounded over time? We can get around this if we make sure that our session tokens include the expiry time of the token and are tamper-proof (e.g., via a MAC or signature). Then we only need to store those sessions that have been logged out but have not yet expired, which will often be a much smaller set.
This is the approach that OpenAM takes. Logged out (but not yet expired) tokens are stored in the Core Token Service (CTS). In order to check validity of a session we validate the cookie signature, check the expiry time, and then check the session blacklist to make sure this token hasn’t been logged out.
Wait? Check the CTS? Doesn’t that provide a bottleneck that will limit scalability? Yes, but in practice we can push that limit up to a really high level.
Firstly, the CTS backend, OpenDJ, can support pretty large read rates in a clustered mode. Session blacklists are monotonic (once a session is blacklisted it is never unblacklisted). This means we can use multimaster replication to scale writes and achieve strong eventual consistency. In the worst case there may be a small time delay before all servers in the cluster know of a new blacklist entry, but this is usually a very short time window. Even in the case of network partitions all sides can continue accepting writes and reads (i.e., an AP system in terms of the CAP theorem) — after the partition heals we can simply take the union of the blacklist from each side to re-establish consistency without conflicts, and we should broadcast session logouts as far as we can in the meantime.
The CP alternative would be for both sides to stop processing requests for the duration of the partition to avoid processing a (stolen) session token that has been logged out on the other side. This would result in total loss of availability for all applications using the session service (i.e., likely an entire organization), which would be unacceptable to most users!
In addition to allowing highly scaled deployments of OpenDJ, the monotonicity of blacklists also allows us to employ aggressive caching strategies on each server to minimise the need to hit DJ at all. As well as a traditional cache, blacklisting is also a textbook use-case for a probabilistic data structure: a Bloom Filter.
Rolling Bloom Filters
A Bloom Filter is a probabilistic data structure that allows us to represent very large sets of objects in a comparatively small amount of memory — typically a few bits per element. This allows us to store many millions of blacklisted sessions in-memory on each server. Effectively, each server has its own highly-compressed copy of the entire blacklist for the whole deployment, allowing session validity to be checked locally in most cases and regaining horizontal scalability.
The trade-off is that a Bloom Filter may produce false-positives. That is, if the filter says that an element is not in the set, then this is definitely the case. However, in other cases the filter can only say that the element might be in the set. In this case we need to then check a definitive source to see for sure. In the context of session blacklisting then, we can store the active session blacklist in a Bloom Filter on each server and use this to check session validity: if the filter says the session is not blacklisted then we can trust that the session is still valid. Otherwise we must query the definitive blacklist in the CTS to tell for sure.
As we would expect most session requests to be for valid (i.e., non-blacklisted) sessions, then we would expect the Bloom Filter to be able to validate most sessions without needing to query the CTS. A small proportion of sessions will be flagged as maybe blacklisted when they are actually still valid, but we can mitigate this in two ways:
- firstly, we can tune the false-positive probability (fpp) of the Bloom Filter to be arbitrarily low by using more memory
- secondly, as the set of sessions that the BF is unsure about is itself monotonic, we can cache the lookup from the CTS in a traditional in-memory LRU cache to speed up repeated validations of the same token on the same server (an extremely common pattern as tokens are typically validated repeatedly in the course of a single request).
The main problem with the Bloom Filter (BF) approach is that it is not possible to remove an element from a BF after it has been added. While there are solutions to allow a BF to grow over time, we would still eventually run out of memory. Counting filters also support deletion of elements, but at the cost of using much more memory, and not being able to expand dynamically to accommodate changes in load.
To get around this problem, we designed an apparently novel variant that we call a Rolling Bloom Filter (RBF). An RBF takes advantage of the fact that blacklisted sessions can safely be forgotten once they have expired. We therefore use a chain of Bloom Filters, with each item in the chain representing sessions that are due to expire within some time window. Once that window has expired then we can safely dispose of that entire Bloom Filter and remove it from the chain.
Following the design of the Scalable Bloom Filter paper (SBF), we size each subsequent “bucket” in the chain according to a geometric series, such that the sum of false positive probabilities never exceeds our desired limit for the chain as a whole. Unlike the SBF approach we allocate our Bloom Filters from a geometric series represented as a pool — when a bucket is discarded its index in the series is returned to the pool to be reused by the next created bucket. A bucket is only created when the previous bucket in the chain has become saturated — determined by counting the proportion of bits set to 1 in the underlying bit-vector.
I have only briefly described the details of this approach, but as we are an open source company you can read the source for the full picture (free Backstage account required). You will also find high-performance atomic, synchronized and copy-on-write (with write batching) implementations of normal, scalable, and rolling bloom filters (partially based on the Guava BloomFilter).
Stable Session Identifiers
The above all sounds very nice, but it falls down if sessions contain state that may change over the life of the session. As the state of the session is encoded into the session token itself, this means that the token will change and any blacklist may only reflect the token at the point at which it was logged out. If an attacker has hijacked an earlier version of the token then they will still be able to use it.
For this reason, in OpenAM each stateless session is created with a unique random UUID that serves as a stable identifier for that session. This UUID never changes during the life of a session, and so can be used to blacklist all versions of the session token in a single entry. It is also considerably more compact to store than the full session token. The stable storage ID is protected by the same signature as the rest of the token, preventing a client from tampering with it.
When isn’t stateless appropriate?
Given the above discussion, you may be wondering why we still offer a stateful session implementation in OpenAM. One reason might be maturity—while we are pretty confident in the quality of our engineering, and our QA team have done a great job testing the stateless functionality, this is the first production release of the technology and some may prefer to wait until it has stood the test of time. Beyond that, there are some architectural reasons why stateless may not be a good fit:
- If you have regulatory or legal requirements that certain session-related data does not leave your organisation’s infrastructure, then a stateful approach may be more appropriate.
- If you store a very large amount of data in sessions, then a stateless JWT may well exceed the maximum cookie size (around 4Kb in most browsers). This is not a recommended practice, but in some situations it may be unavoidable.
- If you care about squeezing every last millisecond out of raw response-time latency. Decoding and verifying a JWT adds some unavoidable cryptographic overhead to each request. While we do our best to minimise this, and come pretty close to stateful, it’s still somewhat slower per-request.
- If you want to be able to list and invalidate sessions from the admin console. This is a useful feature of stateful sessions that is just impossible to reproduce in a stateless approach. We have some ideas for limited functionality in this area, but it will never match the capabilities of stateful sessions if administrator session management is essential.
For more information on configuring stateless sessions, see the OpenAM 13 Administration Guide.
If you are looking for advice on deploying stateless sessions with OpenAM, or other general IAM consultancy, I am now available as an independent contractor: see https://pando.software/ for details.