Datalog is a logic programming language, based on Prolog, which is seeing something of a resurgence in interest in recent years. In particular, several recent approaches to authorization (working out who can do what) have used Datalog as the logical basis for access control decisions. On the face of it, this seems like a perfect fit, and there’s a lot to recommend it. I myself have been a fan of Datalog since first coming across it at the start of my PhD studies back in 2003, and have even written papers advocating for it. However, although I think it has a lot of benefits, I think there is some confusion about some of its complexity results that means it is not always as good a fit as you may be led to believe.
What exactly is Datalog?
Datalog is a cleaned up version of Prolog, created by the database community as an attempt to marry logic programming techniques with relational databases to create deductive databases. I have a great deal of respect for the database community, and when they take an interest in a topic it’s almost always worth paying attention to the solutions they come up with. (See also injection attacks on SQL, which they neatly solved with parameterized queries — something web browsers have still failed to solve convincingly). In this case, Datalog improved Prolog in several ways to make it useful as a database query language:
- Removing all the non-logical operators, such as the dreaded “cut” and I/O “predicates”, which destroy the logical purity of Prolog.
- Making the language not Turing-complete, thereby guaranteeing that all queries will eventually terminate.
- Discarding Prolog’s top-to-bottom backtracking execution strategy, where the order of rules in a program (or conditions in a rule) can drastically alter the correctness, performance or even termination of a query. In Datalog, the meaning and efficiency of a program is the same no matter what order the rules appear in.
Datalog allows you to write nice logical rules, such as the following classic definition of an ancestor as being the transitive closure of a parent relation:
ancestor_of(Ancestor, Descendant) if parent_of(Ancestor, Descendant).
ancestor_of(Ancestor, Descendant) if
parent_of(Ancestor, ChildOfAncestor) and
If you’re not familiar with Prolog syntax, then this says that a person is an ancestor of someone if they are their parent (the first rule), or if they are the parent of someone else who is an ancestor of them. This is a pretty basic (and dull) example, and not many people have a need for ancestral access control, but it hopefully gives you an idea of the power of Datalog for succinctly expressing relationships between objects.
Why do people like it for authorization?
Declarative rule-based approaches to authorization are attractive because they allow you to neatly separate such rules from the business logic of your application (or that’s what people hope, at least). Datalog is a particularly nice approach to this, because the rules are very expressive and concise, and also because it provides some useful features. For example, suppose we defined a set of rules to determine if a particular user should have access to a particular resource:
allow(Operation, Resource, User) if …
% Example - allow a user to update any resource they own:
allow(update, Resource, User) if
We can call this to check if a particular request should be allowed —
allow(update, “/users/alice”, bob)? — but we can also ask more general queries, such as: which user’s can update Alice’s profile:
allow(update, “/users/alice”, User)? Here
User with a capital letter is a variable and the interpreter will tell us all possible instantiations of that variable (all users) for which the query is true: i.e., who can update that resource. Likewise, we can query the same rules to see all resources that Bob can access, and all operations he can perform on them. These are incredibly useful features to have in an authorization system when you want to review who has access to what.
One of the most misunderstood parts of Datalog refers to its time complexity: how long it takes to answer a query as the “size” of that query grows. You will often see statements that Datalog queries are guaranteed to terminate in “polynomial time” (as opposed to exponential time). This is true, but only when taken in a very narrow sense.
First of all, we have to ask “polynomial in what variable?” When we talk about Datalog having polynomial time complexity, we are referring to its data complexity. That is, if we fix a given program (set of rules), how does the time needed to answer a query change as the size of the database of facts increases? In SQL terms, consider the program to be our SQL query (or views) and we’re asking how much longer that query takes to run as our tables get bigger. This is the primary complexity measure that most people care about when talking about databases, and indeed in this case Datalog has polynomial time complexity.
But is this the appropriate measure of complexity for authorization? In this case I would argue that the situation is almost entirely reversed. In a typical authorization scenario, the data is largely of fixed size: the specific resource being accessed is known and the user has already been identified by authentication, so we don’t need to look at a whole tables of data. On the other hand, there can be very large numbers of authorization rules that have to be considered to make that decision. So the size of data is fixed, but the number of rules is variable. In this case, the appropriate complexity measure is known as program complexity and the bad news is that Datalog has exponential program complexity!
The other factor we need to consider is that the polynomial data complexity of Datalog is based on the most basic form, which doesn’t have function symbols (ie datatypes like lists) or arithmetic. Once you start adding such features (which almost all real-world Datalogs do), the data complexity inevitably goes up.
So what else could you use?
The exponential program complexity of Datalog is well-known, and it has not escaped academic treatments of Datalog for authorization. An influential paper in this regard is “Datalog with Constraints: A Foundation for Trust Management Languages”, which has been referenced by the developers of Biscuit. This paper shows how to build a powerful authorization logic by extending Datalog. But the actual authorization system they build in that paper is based on a system called RT, which is much closer to a description logic than Datalog. They then translate this more restricted language (which has polynomial program complexity) into their Datalog variant, which is something some description logic reasoners also do.
Description logics share some of the good properties of Datalog, and there has been a lot of research into description logic variants with a range of time complexities. So that is certainly one alternative formalism to look at if you are interested in logical approaches to authorization.
On the other extreme, we are now seeing approaches like Google Zanzibar, which apparently scales to trillions of ACL entries and uses a very minimal language to define relationships between roles (such as that all editors of a document are also viewers). Again, to my eyes this looks like a (very restricted) description logic to me, albeit one with a very large-scale implementation!
On the other hand, this same reasoning that description logics (can) have better worst-case time complexity than Datalog led me to base my PhD thesis on them, which was a decision I frequently regretted. The trade-off for those tighter complexity bounds is a much less expressive language, and I frequently ran into concepts that were hard or impossible to express in description logic but would’ve been simple in Datalog. (There are much more expressive description logics too, but I still find them hard to work with).
So is Datalog a good language for authorization?
Ultimately, I think logic-based approaches to defining authorization policies are an excellent idea. As an example of that approach, Datalog is still an excellent choice. But I wouldn’t say it is uniquely effective in this area. Some of the extensions applied to Datalog in practice push it more towards a full-blown Prolog, and I wouldn’t be surprised if some of these are accidentally Turing-complete. This is not a bad thing: pure Prolog (without the cruft) is also an excellent language, and if you’re not accepting user-defined policies then you can avoid pathological performance cases.
On the other hand, if you want to scale to very large numbers of entities with only very simple rules then some of the lighter-weight description logic dialects are, IMO, an excellent fit. Just be aware of the expressiveness trade-off. There’s an enormous tree of logics out there, no need to get hung up on just one branch.