Why you really can parse HTML (and anything else) with regular expressions

There is a common story that has played out many times in the history of computer science. A software engineer learns regular expressions for the first time, feels like a god, and then immediately over-applies them to situations they are not great at, like parsing HTML. What then usually happens is that a more experienced developer patiently explains why HTML cannot be parsed with regular expressions, because HTML is (at least) a context-free language, and you should use a proper HTML parser instead. In this post, I’ll explain why the advice is right, but the reasoning is wrong.

Note: in this post, I am talking about regular expressions as defined in theory, not the heavily extended regexes of Perl etc.

Automata and languages

The justification for this view is given by the Chomsky Hierarchy, which relates classes of formal languages to classes of automata that can parse those languages. At the bottom rung, we have the regular languages, which can be parsed by simple finite-state automata (also known as a finite-state machine, FSM). Above that we have context-free languages, which require a (non-deterministic) pushdown automata (essential a finite-state machine with a stack data structure). And above that, we have the context-sensitive and recursively-enumerable languages, which both require some form of Turing machine (essentially a FSM plus a random access memory).

The Chomsky Hierarchy
Language class Automata needed to recognise
Regular languages Finite-state machines
Context-free languages Pushdown automata
Context-sensitive languages Linear-bounded Turing machines
Recursively-enumerable languages Turing machines

Automata are abstract

The definitions of all the above automata classes are not in terms of actual physical machines, but rather as abstract descriptions of a class of machines, devoid of any physical constraints at all. For example, the definition of a finite-state machine is usually in terms of an abstract state transition function, 𝛿, which takes as input the current state (abstractly specified) and an input symbol, and produces as output the next state:

newState := 𝛿(oldState, inputSymbol)

There are obviously many ways in which such a state-transition function may be implemented. The most obvious implementation is to use a lookup table, indexed by state (as an integer) and input symbol. If only a small number of state transitions are valid, then we may want to use a sparse array representation or a hashtable. If we have a very large number of states and transitions then we could store the transition function in a database table, making use of a variety of sophisticated data structures, such as B-tree indexes. The point is that nothing in the abstract definition of a finite-state machine dictates how we should implement such a machine.

We could even use a stack to implement a finite-state machine.

In fact, almost all real implementations of a FSM will involve a stack at some point, in the form of the call-stack of the language they are implemented in. Some back-tracking regular expression implementations will use a stack more explicitly. Does this suddenly make the implementation a pushdown automata? Of course not.

The real difference between automata classes

If the presence of a stack is not itself enough to make a FSM into a pushdown automata, what is? The essential difference is that the stack in a pushdown automata may grow unbounded. There is no limit to the number of items that can be pushed onto such a stack. Any stack used in a real implementation will of course be bounded by available memory.

The same applies to Turing machines. We could implement our FSM using random-access memory (indeed, it is hard not to!), but the crucial difference is that the tape in a Turing machine is again unbounded. In a linear-bounded Turing machine it is restricted to a tape that is a linear function (multiple) of the size of its initial input. As the initial input may be arbitrarily sized, in practice even a linear-bounded Turing machine is not physically realisable.

In the real world, there is some finite limit to the amount of stuff we could ever provide to any machine. We just don’t have access to enough universe. All of this memory needs to be represented somehow, and even with clever compression techniques, an unbounded amount of memory will eventually require infinite resources (otherwise we must have applied some upper bound).

No pushdown automata or Turing machine (that is not also a FSM) can be represented in a finite amount of memory.

The real-world restriction on the amount of memory we can provide a machine with also impacts how we can represent the input to that machine. We only have a finite amount of space to store that input, and so the input string must itself be finite. A language whose inputs are all restricted to be finite is known as a finite language, and there is an important result here:

All finite languages are regular.

“Aha!”, you might say, “but I can set up a machine to generate the input on the fly!” This is true, but if your input-generating machine wants to generate a language that is not regular (or random), then it will also need an unbounded memory. There is no escaping the pushdown automata memory tax.

Regular expressions

We can illustrate these ideas by looking at regular expressions. The common wisdom is that regular expressions cannot be used to parse context-free languages. As we have seen, that is true but irrelevant to any languages that we encounter in the real-world, which are always finite and therefore regular. Let’s see some examples.

A classic example of a context-free language is that of a number of a symbols followed by an equal number of b symbols. That is anbn. If we restrict this to some finite upper bound on n, then we can easily write it as a regular expression. For example, for n up to 3 we have:

|ab|aabb|aaabbb

That is, an empty string or a single a followed by a single b, or two as followed by two bs, and so on.

We can do the same for the trick of matching balanced parentheses (or HTML tags):

[^()]*|[^()]*\([^()]*\)[^()]*|[^()]*\([^()]*\([^()]*\)[^()]*\)[^()]*|...

That is either:

  • A string containing no parentheses, or
  • A string containing no parentheses, followed by a single open parenthesis, followed by another string without any parens, followed by a single closed paren
  • And so on…

We have to handle all valid interleavings of open and close parentheses, but it is doable up to some finite maximum nesting depth.

HTML in the real-world

Real-world web browsers do in fact impose limits on both the maximum size of HTML documents and the maximum nesting depth of elements within those documents. These helpful replies on Hacker News indicate that the maximum nesting depth browsers will tolerate (as of 2015) is around 512 elements or less. So in fact, HTML as it is defined in reality is in fact a finite and regular language.

You can parse (real-world) HTML with regular expressions.

Of course, as we’ve hinted at, that statement should be taken with a large grain of salt:

  • Regular expressions are a terrible notation for describing anything other than extremely simple grammars.
  • The implementation of most regular expression engines is not optimal for the sorts of extremely large state spaces that result from applying a finite limit to a more complex grammar.

That is, even if your grammar is regular, you probably want to use some form of BNF grammar to describe them and to use an implementation that uses stacks and other data structures to avoid explicitly enumerating all states in memory. These implementation tricks can be seen as just compressed representations of the state transition function.

So in reality, as I said at the start, the advice is still the same, but the justification is different. The reason you should use a real HTML parser to parse HTML rather than a regular expression isn’t because regexes are theoretically not up to the job (they are), but rather merely that regular expressions as a notation are crap at describing anything other than simple tokenization problems (for which they are rather good).

The reason you should use a stack (or other data structure) for parsing a language is not because the language is context-free (in reality, it isn’t), but rather just that stacks are a useful data structure in parsing.

Philosophical implications

I am not the first person in the world to point out that formal automata are abstract and bear little relationship to the physical world. The computer scientist and philosopher, Brian Cantwell Smith (author of the absolute classic Reflection and Semantics in Lisp), has been investigating the same foundational issues from a more theoretical viewpoint. His long-forthcoming multi-volume work, Age of Significance, has so far only published a tantalising introduction, which already goes into great detail on the philosophical implications of the abstract definition of automata theory, detached from physical reality. As computational metaphors are applied to a range of scientific disciplines, such as biology and cognitive science, are we sure we even have a firm grasp of what a computer is? It’s not clear to me that we do.

Fundamentally, as I have already said, it is widely agreed that the theory of effective computability focuses on “what can be done by a mechanism.” But three conceptual issues have clouded its proper appreciation—even its formulation.

First, as suggested above, and in spite of its use of such words as ‘effectiveness’ and ‘mechanism,’ the theory is almost always characterised abstractly, as if it were a branch of mathematics. While per se this may not necessarily be a problem, the approach leaves open to question the relation between the allegedly abstract notion of efficacy and very real constraints imposed by physical law, physical reality, hanging in the air, without explication or theoretical legitimacy. It is not enough to assume that God made two independent creative acts; their independence should be defended, if not explained.

Brian Cantwell Smith, Age of Significance

Thanks to Stuart Reeves for pointing me at BCS’s work. It’s an absolute corker, which I highly recommend reading. Hopefully the full volume(s) will eventually appear.

Update 1: clarified statement on infinite memory requirements of PDA/TMs. Thanks to Mike Samuel for pointing out the sloppy wording in the first version.

Author: Neil Madden

Founder of Illuminated Security, providing application security and cryptography training courses. Previously Security Architect at ForgeRock. Experienced software engineer with a PhD in computer science. Interested in application security, applied cryptography, logic programming and intelligent agents.

%d bloggers like this: