When faced with the task of parsing HTML (or XML and some other similar grammars) many people immediately think of using the powerful text processing capabilities of regular expressions to do the work for them. This is usually the wrong approach. HTML is a very 'loose' language to begin with and additionally it has over the years become more and more abused by lazy programmers and novices who don't follow its specifications or grammar rules. This leaves us with tremendous amount of non-conforming or outright broken HTML code out there that is being used on a regular basis. Over the years, parsers have evolved to the point of being able to cope with common problematic HTML and will happily parse out even the most horrible pages for you at least with some degree of accuracy to the document's original intent.
With that said, regular expressions have not (nor would they have any reason to have) evolved over the years to deal with the voluminous amount of horrid HTML out there. They are for matching specific patterns. They can be applied to things that have a known structure or format. They are inherently not good at distinguishing between patterns that a human (or a token parser) could easily distinguish such as (but not limited to) HTML nested in comments, overlapping tags, HTML entities, etc. They are also not good at focusing on a particular part of a document based on the relative structure. Most importantly, they are very bad at adapting to even small changes in the document itself.
So without further ado, here is how you parse HTML documents:
Parsing HTML With Regexes | A perlmonks thread in which #perlhelp's very own woggle discusses the topic at hand. |
Bring Me Your Regexs! I Will Create HTML To Break Them! | An article on how regexes break while parsing HTML. |
Do Not... DO NOT! Parse HTML with Regex's | Further reiteration for the logic impaired. |
HTML::Parser HTML::TableExtract HTML::TokeParser HTML::LinkExtor |
Various Perl HTML Parser modules. |
XML::Parser XML::SAX XML::Simple |
Various Perl XML Parser modules. |
HTML
Agility Pack |
A .NET Parser that is tolerant of malformed (real-world) HTML |
Python
HTMLParser class Python htmllib parsing module Beautiful Soup and a Ruby port called Rubyful Soup (Thanks Ezio!) |
HTML parsers for Python (Thanks Kenneth!) |
Java HTMLParser
Library |
A parser for 'real world' HTML in Java. |
The
Regex Programming Wiki |
Mark from The Regex Programming Wiki sent me a link to his site which has some great regex info as well as links to several HTML parsers in the FAQ section! Check it out! |
<matt at icenine dot ca>