The Landmine of Parsing HTML and Stripping HTML Comments

A while ago I wrote a blog post about how painful it is to properly parse an email address. This post is kind of like that, except that this time, I take on HTML.

I’ve written about parsing HTML with a regular expression in the past and pointed out that it’s extremely tricky and probably not a good idea to use regular expressions in this case. In this post, I want to strip out HTML comments. Why?

I had some code that uses a regular expression to strip comments from HTML, but found one of those feared “pathological” cases in which it seems to never complete and pegs my CPU at 100% in the meanwhile. I figure I might as well look into trying a character by character approach to stripping HTML.

It sounds easy at first, and my first attempt was roughly 34 lines of procedural style code. But then I started digging into the edge cases. Take a look at this:

<p title="<!-- this is a comment-->">Test 1</p>

Should I strip that comment within the attribute value or not? Technically, this isn’t valid HTML since the first angle bracket within the attribute value should be encoded. However, the three browsers I checked (IE 8, FF3, Google Chrome) all honor this markup and render the following.

funky
comment

Notice that when I put the mouse over “Test 1” and the browser rendered the value of the title attribute as a tooltip. That’s not even the funkiest case. Check this bit out in which my comment is an unquoted attribute value. Ugly!

<p title=<!this-comment>Test 2</p>

Still, the browsers dutifully render it:

funkier-comment

At this point, It might seem like I’m spending too much time worrying about crazy edge cases, which is probably true. Should I simply strip these comments even if they happen to be within attribute values because they’re technically invalid. However, it worries me a bit to impose a different behavior than the browser does.

Just thinking out loud here, but what if the user can specify a style attribute (bad idea) for an element and they enter:

<!>color: expression(alert('test'))

Which fully rendered yields: <p style="<!>color: expression(alert('test'))">

If we strip out the comment, then suddenly, the style attribute might lend itself to an attribute based XSS attack.

I tried this on the three browsers I mentioned and nothing bad happened, so maybe it’s a non issue. But I figured it would probably make sense to go ahead and strip the HTML comments in the cases that the browser. So I decided to not strip any comments within an HTML tag, which means I have to identify HTML tags. That starts to get a bit ugly as <foo > is assumed to be an HTML tag and not displayed while <çoo /> is just content and displayed.

Before I show the code, I should clarify something. I’ve been a bit imprecise here. Technically, a comment starts with a – character, but I’ve referred to markup such as <!> as being a comment. Technically it’s not, but it behaves like one in the sense that the browser DOM recognizes it as such. With HTML you can have multiple comments between the <! and the > delimiters according to section 3.2.5 of RFC 1866.

3.2.5. Comments

   To include comments in an HTML document, use a comment declaration. A
   comment declaration consists of `<!' followed by zero or more
   comments followed by `>'. Each comment starts with `--' and includes
   all text up to and including the next occurrence of `--'. In a
   comment declaration, white space is allowed after each comment, but
   not before the first comment.  The entire comment declaration is
   ignored.

      NOTE - Some historical HTML implementations incorrectly consider
      any `>' character to be the termination of a comment.

   For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HEAD>
    <TITLE>HTML Comment Example</TITLE>
    <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
    <!-- another -- -- comment -->
    <!>
    </HEAD>
    <BODY>
    <p> <!- not a comment, just regular old data characters ->

The code I wrote today was straight up old school procedural code with no attempt to make it modular, maintainable, object oriented, etc… I posted it to refactormycode.com here with the unit tests I defined.

In the end, I might not use this code as I realized later that what I really should be doing in the particular scenario I have is simply stripping all HTML tags and comments. In any case, I hope to never have to parse HTML again. ;)

Comments

12 responses

Peli • November 10th, 2008
What about this input: '<a' ? The comment does not get stripped out :)
Peli • November 10th, 2008
Interresting my test case got stripped out by SubText!
Peli • November 10th, 2008
I was wondering why Pex could not get 100% coverage (1 block not covered) but it turns out you have some dead code in there :

else if (inHtmlComment)
{
if (current == '>')
{
if (inHtmlComment) <--------------- this is true :)
{
inHtmlComment = false;
continue;
}
}
Andrei Rînea • November 10th, 2008
I have written a character-by-character HTML tag stripping at CodeProject.com. Have a look here.
toby mills • November 10th, 2008
You should have a look at Jeff Atwoods experiences on HTML sanitization for
Stack Overflow. He's even provided his code available on RefactorMyCode.com.
T
Thanigainathan S • November 10th, 2008
Hi,
When will be these kind of problems will come in practical scenarios. Will that be useful in that case ?
Thanks & regards,
Thanigainathan.S
Ismail Mayat • November 10th, 2008
html agility kit is brilliant for parsing html see here
Filini • November 10th, 2008
My previous comment was rejected as SPAM, so I guess I cannot paste HTML code in here...
Anyway, when you write "Technically it is not a valid HTML" are you assuming that you are parsing XHTML? The example you posted is a perfectly valid HTML 4.01 Transitional, I just checked on the W3C Validator. If you want, I can mail you an HTML full page source code (that I cannot post here on the comments).
I may be obvious here, but if you are only parsing XHTML you can just load it in an XmlDocument and remove the comments :-)
Clive Chinery • November 10th, 2008
I have written an HTML parser as part of the CommonData project at http://www.CodePlex.Com/CommonData. I will be updating my HTML comment handler.
haacked • November 11th, 2008
@Filini Yes, I am assuming XHTML. But you're right about HTML 4.01 Transitional. I tried it out and it indeed validates.
Thanks for the note. For those that are interested, here's the HTML I tried in the W3C validator:
```
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
  "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title></title></head>
<body>
<p title="<p />"></p>
</body>
</html>
```
jmbr • November 11th, 2008
I find that HTML Tidy tends to come in handy for these tasks.
Filini • November 11th, 2008
@Haaked, in this case, your problem is not removing the comments (as I said in my earlier comment, XmlDocument is an easy solution).
Your problem is getting a valid XHTML in the first place, and that is a road to hell... I mean, an interesting challenge :-)