Why Not Convert HTML to XML?

Pat Gannon (no blog) makes a great point in the comments on my post about using regular expressions to parse HTML. He says:

Just to play devil’s advocate for a minute, it seems like HTML is just too darned close to XML to have to parse this way. Isn’t there a library out there for converting HTML into XHTML? If you can do that, you can just read the file in using XmlDocument::LoadXml(). Once you’ve done that, you can find your tags using an XPath query. Sorry, I just couldn’t let a parsing post go by without tossing in my two cents ;)

In fact, there are two approaches to this. The first recognizes that HTML is really just a subset of SGML. Thus if you have a SGML parser, you’re done. So one option is to try Chris Lovett’s SgmlReader.

In fact, this is what the current version of RSS Bandit uses for auto-discovery of RSS feeds within HTML content. However, I recently replaced it with regular expressions because of some memory use and performance problems we were having with it. In our case, finding these tags is a lot faster and uses less memory by just using a regular expression. (Now you see the motivation for the post).

Another option is to use Simon Mourier’s HTML Agility Pack. He takes an interesting approach in that he provides an HtmlDocument class that implements System.Xml.XPath.IXPathNavigable. Thus his approach provides the same interface as an XmlDocument for querying nodes, but doesn’t change the underlying HTML content as many other approaches would by converting them to XML.

And just to toot Pat’s horn a bit, I used to be his manager at Solien when he was just starting out in his career. Now he works at Univision and has inherited reams of code that parse through Fortran code as well as proprietary database files. He’s also written his own grammar engine and xml syntax for describing computer languages such as C#. So he knows a thing or two about parsing text. He’s become quite a top notch developer. I’m just waiting for him to get off his arse and start a blog.

Comments

5 responses

Jon Galloway • October 26th, 2004
The HTML Agility Pack worked well for my group in a recent project - XPath turned out to be a lot simpler than RegEx's due to nesting issues (comment in previous post).

Niels • October 26th, 2004
I vote that you continue to use Regular Expressions. I love them and I only wished I had known about them earlier. I have friend that was using them in PHP and at first I thought it was another hack that script programmers would use. However, once I saw the extreme power of using regular expressions, I was hooked!

Phil, do you realize how much easier that damn Address Validation application would have been if we had used Regular Expressions?!!

Haacked • October 26th, 2004
Ha! No shit. That was one tricky piece of code. But address validation is a tough cookie to parse, even with regular expressions because of all the damn exceptions.

Pat • October 27th, 2004
...and I play the tambourine! <j/k> Thanks for the kudos, Phil. I am meaning to start a blog, but [insert lame excuse here].

Bruno D. • July 19th, 2007
Here's another take, which nobody seems to be taking. If regex is superior in speed and memory use, and XPath much simpler to query with, why don't people create a simple XPath -> Regex converter? It's use would be so damn simple, create an XPath string, call this little convert script (or class or function or whatever) which writes for you the Regex equivalent and bam you just have to pass it to your favorite Regex handling function. IMHO, should be 10x simpler than converting a whole HTML page and hoping it'll pass through the parser.
I've been fiddling with trying to scrape pages in PHP for like 2 months now, tried a whole bunch of different solutions. I tried the same in Java (I'm mostly a java dev) and decided to give up on XPath because it tends to break too much on poorly designed web pages (which are, let's admit, a very occurance). So far I wasn't able to find such a thing, sadly.