Question:
You must remove all the URL attribute href
tag a
in an HTML page. I tried using regular expressions:
Uri uri = new Uri("http://google.com/search?q=test");
Regex reHref = new Regex(@"<a[^>]+href=""([^""]+)""[^>]+>");
string html = new WebClient().DownloadString(uri);
foreach (Match match in reHref.Matches(html))
Console.WriteLine(match.Groups[1].ToString());
But there are many potential problems:
- How to filter only specific links, for example, by CSS class?
- What happens if the quotes of the attribute are different?
- What if there are spaces around the equal sign?
- What happens if a piece of the page is commented out?
- What happens if a piece of JavaScript comes across?
- Etc.
The regular expression very quickly becomes monstrous and unreadable, and more and more problem areas are found.
What to do?
Answer:
TL; DR
Use AngleSharp to parse HTML.
If you need not only to parse HTML, but also launch a full-fledged browser, execute all scripts, press buttons and see what happens, then use CefSharp or Selenium . Note that this will be orders of magnitude slower.
For the curious
Regular expressions are designed to handle relatively simple texts that are defined by regular languages . Regular expressions have grown in complexity since their inception, especially in Perl, whose implementation of regular expressions is the inspiration for other languages and libraries, but regular expressions are still ill-suited (and hardly ever will be) for handling complex languages like HTML. The complexity of HTML processing also lies in very complex rules for processing invalid code, which were inherited from the first implementations of the times of the birth of the Internet, when there were no standards at all, and each browser manufacturer piled up unique and inimitable features.
So, in general, regular expressions are not the best candidate for HTML processing. It is usually wiser to use specialized HTML parsers.
AngleSharp
License: BSD (3-clause)
Verified player on the field of parsers. Unlike CsQuery, it is written from scratch by hand in C #. Also includes parsers for other languages.
The API is built on top of the official JavaScript HTML DOM specification. Initially, it contained oddities in some places that are unusual for .NET developers (for example, when accessing an invalid index in the collection, null
will be returned, and not an exception thrown), but the developer finally gave up and fixed the most creepy crutches. Something went away by itself, like the Microsoft BCL Portability Pack. Something remains, for example, a namespace is very granular, even the basic usage of the library requires a three- using
, and so on. P.), But in general anything critical.
HTML processing is simple:
IHtmlDocument angle = new HtmlParser(html).Parse();
foreach (IElement element in angle.QuerySelectorAll("a"))
Console.WriteLine(element.GetAttribute("href"));
It does not get complicated, and if you need more complex logic:
IHtmlDocument angle = new HtmlParser(html).Parse();
foreach (IElement element in angle.QuerySelectorAll("h3.r a"))
Console.WriteLine(element.GetAttribute("href"));
HtmlAgilityPack
License: Ms-PL
The oldest and therefore the most popular .NET parser. However, age does not mean quality, for example, already TEN (!!!) YEARS CANNOT (!!!) CORRECT A CRITICAL (!!!) BUG with correct handling of self-closing tags. CodePlex has already died, and a lot of Incorrect parsing of HTML4 optional end tags are still there. Here, the new version of the bug is already the fourth year: Self closing tags modified . There are still a number of analogs. Some time ago they "fixed" this bug. For one tag. With additional option. And then they broke the option. I'm already silent about the fact that there are oddities in the API, for example, if nothing is found, null
returned, and not an empty collection.
XPath is used to select elements, not CSS selectors. On simple requests, the code turns out to be more or less readable:
HtmlDocument hap = new HtmlDocument();
hap.LoadHtml(html);
HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes("//a");
if (nodes != null)
foreach (HtmlNode node in nodes)
Console.WriteLine(node.GetAttributeValue("href", null));
However, if complex queries are needed, then XPath is not very suitable for mimicking CSS selectors:
HtmlDocument hap = new HtmlDocument();
hap.LoadHtml(html);
HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes(
"//h3[contains(concat(' ', @class, ' '), ' r ')]/a");
if (nodes != null)
foreach (HtmlNode node in nodes)
Console.WriteLine(node.GetAttributeValue("href", null));
Fizzler
License: LGPL
An add-on to the HtmlAgilityPack that allows the use of CSS selectors.
HtmlDocument hap = new HtmlDocument();
hap.LoadHtml(html);
foreach (HtmlNode node in hap.DocumentNode.QuerySelectorAll("h3.r a"))
Console.WriteLine(node.GetAttributeValue("href", null));
Since this is HtmlAgilityPack, all the bugs of this product are attached.
CsQuery
License: MIT
At the moment the project is abandoned because there is AngleSharp.
One of the modern HTML parsers for .NET. The validator.nu parser for Java is taken as a basis, which in turn is a port of the parser from the Gecko engine (Firefox). This will ensure that the parser handles the code exactly the same as modern browsers.
The API takes inspiration from jQuery and uses the CSS selector language to select elements. The method names are copied almost one-to-one, which means that for programmers familiar with jQuery, learning will be simple.
Has high performance. It surpasses HtmlAgilityPack + Fizzler by orders of magnitude in speed on complex queries.
CQ cq = CQ.Create(html);
foreach (IDomObject obj in cq.Find("a"))
Console.WriteLine(obj.GetAttribute("href"));
If a more complex query is required, then the code is practically not complicated:
CQ cq = CQ.Create(html);
foreach (IDomObject obj in cq.Find("h3.r a"))
Console.WriteLine(obj.GetAttribute("href"));
If one is unfamiliar with jQuery concepts, then non-trivial usage can be strange and unusual.
Regex
Scary and terrible regular expressions. It is undesirable to use them, but sometimes it becomes necessary, since parsers that build the DOM are noticeably more power Regex
than Regex
: they consume more processor time and memory.
If it comes to regular expressions, then you need to understand that you cannot build a universal and absolutely reliable solution on them. However, if you want to parse a specific site, then this problem may not be so critical.
For heaven's sake, don't make regexes an unreadable mess. You don't write C # code on a single line with one-letter variable names, so you don't need to mess up regular expressions. The .NET regular expression engine is powerful enough to write quality code.
For example, here's a slightly modified code to extract links from a question:
Regex reHref = new Regex(@"(?inx)
<a \s [^>]*
href \s* = \s*
(?<q> ['""] )
(?<url> [^""]+ )
\k<q>
[^>]* >");
foreach (Match match in reHref.Matches(html))
Console.WriteLine(match.Groups["url"].ToString());