Why should Regex not be used to handle HTML?


I understand that if I try to use Regex over HTML, the HTML tags will ooze out of my eyes like liquid pain , among other horrors. And that I should use an XML parser or something.

My curious child side keeps asking me: but why? Why, dear Stack Overflow, can't I use regular expressions to mine fields in markup languages?



HTML is simply more complex than Regex , to the point that it's impossible to have a Regular Expression that handles HTML satisfactorily.


The most direct explanation is in formal language with the help of the Chomsky Hierarchy , which basically organizes languages ​​according to their complexity, that is, freedom of rules and language-recognizing machine.

From Type 0 , the most complex – which generates all the grammars recognizable by the Turing Machine, to Type 3 , the least complex, recognizable by a simple finite automaton.

  • Regex or Regular Expression is the implementation of the Regular Grammar , Type 3 of the Chomsky Hierarchy, being a linear grammar, easily recognized by a finite automaton .

  • HTML , on the other hand, derives from SGML, which is a Context-Free Language (LLC), generated by a Context-Free Grammar (GLC), of Type 2, recognized by a automaton with a stack . And neither HTML nor LLC is, despite being Turing Complete by Rule 110 in combination with CSS3.

That is, a finite automaton is not enough to recognize the HTML, a stack automaton is needed to recognize the SGML, which generated the HTML. Thus, by definition it is not possible to recognize Regular Expression HTML satisfactorily (which covers all cases).


  1. Grammars in the Chomsky Hierarchy are not mutually-exclusively separated but by subsets: Type #0 ⊊ #1 ⊊ #2 ⊊ #3.
