I understand that if I try to use Regex over HTML, the HTML tags will ooze out of my eyes like liquid pain , among other horrors. And that I should use an XML parser or something.
My curious child side keeps asking me: but why? Why, dear Stack Overflow, can't I use regular expressions to mine fields in markup languages?
HTML is simply more complex than Regex , to the point that it's impossible to have a Regular Expression that handles HTML satisfactorily.
The most direct explanation is in formal language with the help of the Chomsky Hierarchy , which basically organizes languages according to their complexity, that is, freedom of rules and language-recognizing machine.
From Type 0 , the most complex – which generates all the grammars recognizable by the Turing Machine, to Type 3 , the least complex, recognizable by a simple finite automaton.
Regex or Regular Expression is the implementation of the Regular Grammar , Type 3 of the Chomsky Hierarchy, being a linear grammar, easily recognized by a finite automaton .
HTML , on the other hand, derives from SGML, which is a Context-Free Language (LLC), generated by a Context-Free Grammar (GLC), of Type 2, recognized by a automaton with a stack . And neither HTML nor LLC is, despite being Turing Complete by Rule 110 in combination with CSS3.
That is, a finite automaton is not enough to recognize the HTML, a stack automaton is needed to recognize the SGML, which generated the HTML. Thus, by definition it is not possible to recognize Regular Expression HTML satisfactorily (which covers all cases).
- Grammars in the Chomsky Hierarchy are not mutually-exclusively separated but by subsets: Type #0 ⊊ #1 ⊊ #2 ⊊ #3.