How to parse HTML/XML

(Or Any Arbitrarily Nested Data)

Summary

When faced with the task of parsing HTML (or XML and some other similar grammars) many people immediately think of using the powerful text processing capabilities of regular expressions to do the work for them. This is usually the wrong approach. HTML is a very 'loose' language to begin with and additionally it has over the years become more and more abused by lazy programmers and novices who don't follow its specifications or grammar rules. This leaves us with tremendous amount of non-conforming or outright broken HTML code out there that is being used on a regular basis. Over the years, parsers have evolved to the point of being able to cope with common problematic HTML and will happily parse out even the most horrible pages for you at least with some degree of accuracy to the document's original intent.

With that said, regular expressions have not (nor would they have any reason to have) evolved over the years to deal with the voluminous amount of horrid HTML out there. They are for matching specific patterns. They can be applied to things that have a known structure or format. They are inherently not good at distinguishing between patterns that a human (or a token parser) could easily distinguish such as (but not limited to) HTML nested in comments, overlapping tags, HTML entities, etc. They are also not good at focusing on a particular part of a document based on the relative structure. Most importantly, they are very bad at adapting to even small changes in the document itself.

So without further ado, here is how you parse HTML documents:

DON'T use a Regular Expression (Regex, Regexp, RE)

Regular Expressions often break when parsing nested data.
Writing regular expressions to parse HTML/XML will not save you time, it will waste your time.
Don't ask for people to help you write a regex to parse HTML/XML -- if they are qualified to help you, they already know you should be using a parser anyway.

DO use an HTML/XML Parser (examples)

HTML/XML Parsers are (coincidentally) designed to parse HTML/XML.
The people that spent the time writing parsers would simply have done it with a regular expression if that was the right way to do it.

When you can make some very strict guarantees about your data, it MIGHT be okay to parse it with a regular expression.

If...

This is a one-time script
AND the data has a known regular structure
AND the tags do not span lines
AND there are no multiple nested tags
AND the parts you need from the data are simple in nature

**If you can not guarantee ALL of the above, DON'T DON'T DON'T use a regular expression**

Links

Further Discussion

Parsing HTML With Regexes	A perlmonks thread in which #perlhelp's very own woggle discusses the topic at hand.
Bring Me Your Regexs! I Will Create HTML To Break Them!	An article on how regexes break while parsing HTML.
Do Not... DO NOT! Parse HTML with Regex's	Further reiteration for the logic impaired.

Parsers

HTML::Parser HTML::TableExtract HTML::TokeParser HTML::LinkExtor	Various Perl HTML Parser modules.
XML::Parser XML::SAX XML::Simple	Various Perl XML Parser modules.
HTML Agility Pack	A .NET Parser that is tolerant of malformed (real-world) HTML
Python HTMLParser class Python htmllib parsing module Beautiful Soup and a Ruby port called Rubyful Soup (Thanks Ezio!)	HTML parsers for Python (Thanks Kenneth!)
Java HTMLParser Library	A parser for 'real world' HTML in Java.
The Regex Programming Wiki	Mark from The Regex Programming Wiki sent me a link to his site which has some great regex info as well as links to several HTML parsers in the FAQ section! Check it out!

Please note, I'm very interested in hearing of parser implementations that I'm missing or in languages not covered here. If you know of any, please send me a note to the address at the bottom of this page. If you find this page useful, I'd also appreciate hearing from you!

If you would like a specific credit other than a 'thanks <your name>' also, please let me know!