Why should Regex not be used to handle HTML?

Question:

I understand that if I try to use Regex over HTML, the HTML tags will ooze out of my eyes like liquid pain , among other horrors. And that I should use an XML parser or something.

My curious child side keeps asking me: but why? Why, dear Stack Overflow, can't I use regular expressions to mine fields in markup languages?

Answer:

TL;DR

HTML is simply more complex than Regex , to the point that it's impossible to have a Regular Expression that handles HTML satisfactorily.

Explanation

The most direct explanation is in formal language with the help of the Chomsky Hierarchy , which basically organizes languages ​​according to their complexity, that is, freedom of rules and language-recognizing machine.

From Type 0 , the most complex – which generates all the grammars recognizable by the Turing Machine, to Type 3 , the least complex, recognizable by a simple finite automaton.

  • Regex or Regular Expression is the implementation of the Regular Grammar , Type 3 of the Chomsky Hierarchy, being a linear grammar, easily recognized by a finite automaton .

  • HTML , on the other hand, derives from SGML, which is a Context-Free Language (LLC), generated by a Context-Free Grammar (GLC), of Type 2, recognized by a automaton with a stack . And neither HTML nor LLC is, despite being Turing Complete by Rule 110 in combination with CSS3.

That is, a finite automaton is not enough to recognize the HTML, a stack automaton is needed to recognize the SGML, which generated the HTML. Thus, by definition it is not possible to recognize Regular Expression HTML satisfactorily (which covers all cases).


Grades:

  1. Grammars in the Chomsky Hierarchy are not mutually-exclusively separated but by subsets: Type #0 ⊊ #1 ⊊ #2 ⊊ #3.
Scroll to Top