The Landmine of Parsing HTML and Stripping HTML Comments

A while ago I wrote a blog post about how painful it is to properly parse an email address. This post is kind of like that, except that this time, I take on HTML.

I’ve written about parsing HTML with a regular expression in the past and pointed out that it’s extremely tricky and probably not a good idea to use regular expressions in this case. In this post, I want to strip out HTML comments. Why?

I had some code that uses a regular expression to strip comments from HTML, but found one of those feared “pathological” cases in which it seems to never complete and pegs my CPU at 100% in the meanwhile. I figure I might as well look into trying a character by character approach to stripping HTML.

It sounds easy at first, and my first attempt was roughly 34 lines of procedural style code. But then I started digging into the edge cases. Take a look at this:

<p title="<!-- this is a comment-->">Test 1</p>

Should I strip that comment within the attribute value or not? Technically, this isn’t valid HTML since the first angle bracket within the attribute value should be encoded. However, the three browsers I checked (IE 8, FF3, Google Chrome) all honor this markup and render the following.

funky comment

Notice that when I put the mouse over “Test 1” and the browser rendered the value of the title attribute as a tooltip. That’s not even the funkiest case. Check this bit out in which my comment is an unquoted attribute value. Ugly!

<p title=<!this-comment>Test 2</p>

Still, the browsers dutifully render it:

funkier-comment 

At this point, It might seem like I’m spending too much time worrying about crazy edge cases, which is probably true. Should I simply strip these comments even if they happen to be within attribute values because they’re technically invalid. However, it worries me a bit to impose a different behavior than the browser does.

Just thinking out loud here, but what if the user can specify a style attribute (bad idea) for an element and they enter:

<!>color: expression(alert('test'))

Which fully rendered yields: <p style="<!>color: expression(alert('test'))">

If we strip out the comment, then suddenly, the style attribute might lend itself to an attribute based XSS attack.

I tried this on the three browsers I mentioned and nothing bad happened, so maybe it’s a non issue. But I figured it would probably make sense to go ahead and strip the HTML comments in the cases that the browser. So I decided to not strip any comments within an HTML tag, which means I have to identify HTML tags. That starts to get a bit ugly as <foo > is assumed to be an HTML tag and not displayed while <çoo /> is just content and displayed.

Before I show the code, I should clarify something. I’ve been a bit imprecise here. Technically, a comment starts with a – character, but I’ve referred to markup such as <!> as being a comment. Technically it’s not, but it behaves like one in the sense that the browser DOM recognizes it as such. With HTML you can have multiple comments between the <! and the > delimiters according to section 3.2.5 of RFC 1866.

3.2.5. Comments

   To include comments in an HTML document, use a comment declaration. A
   comment declaration consists of `<!' followed by zero or more
   comments followed by `>'. Each comment starts with `--' and includes
   all text up to and including the next occurrence of `--'. In a
   comment declaration, white space is allowed after each comment, but
   not before the first comment.  The entire comment declaration is
   ignored.

      NOTE - Some historical HTML implementations incorrectly consider
      any `>' character to be the termination of a comment.

   For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HEAD>
    <TITLE>HTML Comment Example</TITLE>
    <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
    <!-- another -- -- comment -->
    <!>
    </HEAD>
    <BODY>
    <p> <!- not a comment, just regular old data characters ->
    

The code I wrote today was straight up old school procedural code with no attempt to make it modular, maintainable, object oriented, etc… I posted it to refactormycode.com here with the unit tests I defined.

In the end, I might not use this code as I realized later that what I really should be doing in the particular scenario I have is simply stripping all HTML tags and comments. In any case, I hope to never have to parse HTML again. ;)

Technorati Tags: ,,

What others have said

Requesting Gravatar... Peli Nov 10, 2008 6:52 PM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
What about this input: '' ? The comment does not get stripped out :)
Requesting Gravatar... Peli Nov 10, 2008 8:17 PM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
Interresting my test case got stripped out by SubText!
Requesting Gravatar... Peli Nov 10, 2008 8:28 PM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
I was wondering why Pex could not get 100% coverage (1 block not covered) but it turns out you have some dead code in there :


else if (inHtmlComment)
{
if (current == '>')
{
if (inHtmlComment) <--------------- this is true :)
{
inHtmlComment = false;
continue;
}
}
Requesting Gravatar... Andrei Rinea Nov 11, 2008 1:50 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
I have written a character-by-character HTML tag stripping at CodeProject.com. Have a look here.
Requesting Gravatar... toby mills Nov 11, 2008 3:36 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
You should have a look at Jeff Atwoods experiences on HTML sanitization for
Stack Overflow. He's even provided his code available on RefactorMyCode.com.

T
Requesting Gravatar... Thanigainathan S Nov 11, 2008 6:07 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
Hi,

When will be these kind of problems will come in practical scenarios. Will that be useful in that case ?

Thanks & regards,
Thanigainathan.S
Requesting Gravatar... Ismail Mayat Nov 11, 2008 7:42 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
html agility kit is brilliant for parsing html see here
Requesting Gravatar... Filini Nov 11, 2008 9:25 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
My previous comment was rejected as SPAM, so I guess I cannot paste HTML code in here...

Anyway, when you write "Technically it is not a valid HTML" are you assuming that you are parsing XHTML? The example you posted is a perfectly valid HTML 4.01 Transitional, I just checked on the W3C Validator. If you want, I can mail you an HTML full page source code (that I cannot post here on the comments).

I may be obvious here, but if you are only parsing XHTML you can just load it in an XmlDocument and remove the comments :-)
Requesting Gravatar... Clive Chinery Nov 11, 2008 10:26 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
I have written an HTML parser as part of the CommonData project at http://www.CodePlex.Com/CommonData. I will be updating my HTML comment handler.
Requesting Gravatar... haacked Nov 11, 2008 1:42 PM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
@Filini Yes, I am assuming XHTML. But you're right about HTML 4.01 Transitional. I tried it out and it indeed validates.

Thanks for the note. For those that are interested, here's the HTML I tried in the W3C validator:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title></title></head>
<body>
<p title="<p />"></p>
</body>
</html>




Requesting Gravatar... jmbr Nov 12, 2008 12:49 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
I find that HTML Tidy tends to come in handy for these tasks.
Requesting Gravatar... Filini Nov 12, 2008 12:51 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
@Haaked, in this case, your problem is not removing the comments (as I said in my earlier comment, XmlDocument is an easy solution).
Your problem is getting a valid XHTML in the first place, and that is a road to hell... I mean, an interesting challenge :-)
Requesting Gravatar... Chicago mover Aug 17, 2010 3:11 AM
# re: The Landmine of Parsing HTML and Stripping HTML Comments
Great work by the author and the comment my Andrei Rinea was very much needed.

What do you have to say?

(will show your gravatar)
Please add 3 and 6 and type the answer here: