Parse, don’t type-check

There’s a fantastic article from last year titled Parse, don’t validate. I’d highly recommend it to any programmer (along with the more recent follow up Names are not type safety). The basic idea is that there are two ways to check that some input to a function is valid:

  • A validator checks that the input is valid and throws an error if not. It doesn’t return anything. For example, checking that a list is not empty.
  • A parser does the same as a validator, but returns a more specific representation of the input that ensures that the required property is satisfied. For example, checking that a list is not empty and returning a NonEmptyList type.

The thesis of the article is that parsers are preferable to validators. If you’ve not read the original article, please do so – it’s very well written and makes the case much better than I can summarise it. The essential message is to make illegal states unrepresentable. In the article, this is done by making use of the type system. This is a philosophy I entirely agree with, but I want to point out and expand upon one ironic aspect of the argument:

A type checker is a paradigmatic example of a validator!

A type checker, after all, takes as input an already parsed program representation and rejects it if it fails to type-check. It doesn’t return a more specific representation of the program. (Not to be confused with type inference, which does return more information, but only about the types).

So what would be the parser alternative to a type checker in a programming language? One approach would be to actually create a more specific grammar that eliminates illegal states. For example, if a given function requires a number that must be between one and five, then rather than having a generic syntax for function calls such as:

fun_call ::= fun_name ‘(‘ integer ‘)’

You would instead define specific grammar rules for the functions you provide:

one_to_five ::= 1 | 2 | 3 | 4 | 5
fun_call ::= ‘lil_fac(‘ one_to_five ‘)’
| ... other function definitions ...

Creating such specific grammars is one approach to crafting domain-specific languages (DSLs). And indeed, DSLs are a great way of ensuring that illegal states aren’t representable. Of course, this is not a scalable solution if you want to make a general-purpose language with user-defined functions.

An alternative to having a very specific grammar is to instead raise the level of abstraction, so that it is easier to avoid invalid states. For example, a common source of programming errors is out of bounds indexing into arrays. This occurs because the programming language only offers a primitive indexing operation: a[x]. Here “x” is an integer but might be out of bounds, resulting in an exception or crash (if you’re lucky). One way we could prevent this would be to define a more specific type of “integers between zero and 12” so that the type system rejects any potentially invalid indexing operations, and then track this more precise type for each array — validation again.

A different approach would be to notice that there are typically a few common ways that you want to use an array and provide specific abstractions in the language that capture those uses. For example, it’s very common to want to iterate through the array performing some kind of calculation. Rather than making everyone manually write out the for-loops for doing this — do arrays in this language start at 0 or 1? Do they stop at array.length or array.length – 1? Do array indexes have a specific type? — you could instead offer a general fold (reduce) operation. Likewise, instead of making people write their own hashtables you could provide one built-in to the language itself. By providing better abstractions you make it less likely that programmers will hit illegal states.

You can then go further and remove the more primitive operations, only allowing access to the higher level abstractions. Hopefully in 2020 most programmers would agree that removing goto statements in favour of higher level structured programming abstractions was a win, and the same can be true of other low-level constructs: nulls, raw memory pointers, etc.

In the world of computer security, there is a direct analogue of this discussion. The security model of most computer systems allows a separation between operations I can (try to) perform and operations I am allowed to perform: I can try and delete your website, but (hopefully) the request will be rejected as unauthorized. In contrast, in object-capability systems my ability to even invoke such an operation depends upon me holding an unforgeable capability granting me permission to perform it. It’s not possible to even attempt an operation that I am not allowed to perform. For example, in a REST API using capability URIs, I cannot simply send a DELETE request to /users/alice, but instead need to send it to some unguessable random URI – if I don’t already have that URI then I can’t even begin to send the request. Object-capability security therefore aims to make unauthorized states unrepresentable.

Perhaps the most widely known embodiment of the object-capability paradigm is the E programming language: a dynamically-typed object-oriented programming language. Although now somewhat abandoned, it is a fascinating language with a lot of great ideas. E uses strong abstraction boundaries to ensure security. Be warned: the E website is a rabbit hole of incredible depth!

In my twenties, I used to do a lot of programming in Tcl. A language that most would agree is about as far removed from a modern statically typed programming language as you can get. (Often pointed out with something close to utter disgust). And yet the idea of making illegal states unrepresentable was entirely natural to myself and other Tcl programmers. I would frequently start a programming project by creating a blank-slate interpreter, stripping out all the built-in language constructs (loops, procedures, even arithmetic), and then adding back in a carefully selected set of high-level domain-specific primitives. This DSL would then become the application’s configuration file, safe in the knowledge that illegal configurations cannot be expressed.

To summarise, my point is not that type systems are bad or that the original essay is flawed. On the contrary, I thank the flying spaghetti monster almost every day for the type system in my day job (and that is only Java’s tepid concoction), and I think the “parse, don’t validate” essay is excellent. But the last two decades have seen such advances in type systems and the adoption of typeful programming patterns that we are in danger of thinking that type systems are the only way of achieving correctness in software construction. They are a tremendously powerful tool, but more basic techniques of abstraction and information hiding can be just as powerful. The goal of making illegal states unrepresentable should be one of the defining goals of software engineering, but there are many valid ways to crack that nut.

Shameless plug: I discuss capability URIs in depth in chapter 9 of my book on API security, just published by Manning. I also discuss using types to enforce security properties in chapter 6.

Author: Neil Madden

Security Director at ForgeRock. Experienced software engineer with a PhD in computer science. Interested in application security, applied cryptography, logic programming and intelligent agents.

One thought on “Parse, don’t type-check”

Comments are closed.