The Landmine of Parsing HTML and Stripping HTML Comments

code, regex 0 comments suggest edit

A while ago I wrote a blog post about how painful it is to properly parse an email address. This post is kind of like that, except that this time, I take on HTML.

I’ve written about parsing HTML with a regular expression in the past and pointed out that it’s extremely tricky and probably not a good idea to use regular expressions in this case. In this post, I want to strip out HTML comments. Why?

I had some code that uses a regular expression to strip comments from HTML, but found one of those feared “pathological” cases in which it seems to never complete and pegs my CPU at 100% in the meanwhile. I figure I might as well look into trying a character by character approach to stripping HTML.

It sounds easy at first, and my first attempt was roughly 34 lines of procedural style code. But then I started digging into the edge cases. Take a look at this:

<p title="<!-- this is a comment-->">Test 1</p>

Should I strip that comment within the attribute value or not? Technically, this isn’t valid HTML since the first angle bracket within the attribute value should be encoded. However, the three browsers I checked (IE 8, FF3, Google Chrome) all honor this markup and render the following.

funky
comment

Notice that when I put the mouse over “Test 1” and the browser rendered the value of the title attribute as a tooltip. That’s not even the funkiest case. Check this bit out in which my comment is an unquoted attribute value. Ugly!

<p title=<!this-comment>Test 2</p>

Still, the browsers dutifully render it:

funkier-comment 

At this point, It might seem like I’m spending too much time worrying about crazy edge cases, which is probably true. Should I simply strip these comments even if they happen to be within attribute values because they’re technically invalid. However, it worries me a bit to impose a different behavior than the browser does.

Just thinking out loud here, but what if the user can specify a style attribute (bad idea) for an element and they enter:

<!>color: expression(alert('test'))

Which fully rendered yields: <p style="<!>color: expression(alert('test'))">

If we strip out the comment, then suddenly, the style attribute might lend itself to an attribute based XSS attack.

I tried this on the three browsers I mentioned and nothing bad happened, so maybe it’s a non issue. But I figured it would probably make sense to go ahead and strip the HTML comments in the cases that the browser. So I decided to not strip any comments within an HTML tag, which means I have to identify HTML tags. That starts to get a bit ugly as <foo > is assumed to be an HTML tag and not displayed while <çoo /> is just content and displayed.

Before I show the code, I should clarify something. I’ve been a bit imprecise here. Technically, a comment starts with a – character, but I’ve referred to markup such as <!> as being a comment. Technically it’s not, but it behaves like one in the sense that the browser DOM recognizes it as such. With HTML you can have multiple comments between the <! and the > delimiters according to section 3.2.5 of RFC 1866.

3.2.5. Comments

   To include comments in an HTML document, use a comment declaration. A
   comment declaration consists of `<!' followed by zero or more
   comments followed by `>'. Each comment starts with `--' and includes
   all text up to and including the next occurrence of `--'. In a
   comment declaration, white space is allowed after each comment, but
   not before the first comment.  The entire comment declaration is
   ignored.

      NOTE - Some historical HTML implementations incorrectly consider
      any `>' character to be the termination of a comment.

   For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HEAD>
    <TITLE>HTML Comment Example</TITLE>
    <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
    <!-- another -- -- comment -->
    <!>
    </HEAD>
    <BODY>
    <p> <!- not a comment, just regular old data characters ->

The code I wrote today was straight up old school procedural code with no attempt to make it modular, maintainable, object oriented, etc… I posted it to refactormycode.com here with the unit tests I defined.

In the end, I might not use this code as I realized later that what I really should be doing in the particular scenario I have is simply stripping all HTML tags and comments. In any case, I hope to never have to parse HTML again. ;)

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.

Comments

avatar

12 responses

  1. Avatar for Peli
    Peli November 10th, 2008

    What about this input: '<a<!---->' ? The comment does not get stripped out :)

  2. Avatar for Peli
    Peli November 10th, 2008

    Interresting my test case got stripped out by SubText!

  3. Avatar for Peli
    Peli November 10th, 2008

    I was wondering why Pex could not get 100% coverage (1 block not covered) but it turns out you have some dead code in there :

    else if (inHtmlComment)
    {
    if (current == '>')
    {
    if (inHtmlComment) <--------------- this is true :)
    {
    inHtmlComment = false;
    continue;
    }
    }

  4. Avatar for Andrei Rînea
    Andrei Rînea November 10th, 2008

    I have written a character-by-character HTML tag stripping at CodeProject.com. Have a look here.

  5. Avatar for toby mills
    toby mills November 10th, 2008

    You should have a look at Jeff Atwoods experiences on HTML sanitization for
    Stack Overflow. He's even provided his code available on RefactorMyCode.com.
    T

  6. Avatar for Thanigainathan S
    Thanigainathan S November 10th, 2008

    Hi,
    When will be these kind of problems will come in practical scenarios. Will that be useful in that case ?
    Thanks & regards,
    Thanigainathan.S

  7. Avatar for Ismail Mayat
    Ismail Mayat November 10th, 2008

    html agility kit is brilliant for parsing html see here

  8. Avatar for Filini
    Filini November 10th, 2008

    My previous comment was rejected as SPAM, so I guess I cannot paste HTML code in here...
    Anyway, when you write "Technically it is not a valid HTML" are you assuming that you are parsing XHTML? The example you posted is a perfectly valid HTML 4.01 Transitional, I just checked on the W3C Validator. If you want, I can mail you an HTML full page source code (that I cannot post here on the comments).
    I may be obvious here, but if you are only parsing XHTML you can just load it in an XmlDocument and remove the comments :-)

  9. Avatar for Clive Chinery
    Clive Chinery November 10th, 2008

    I have written an HTML parser as part of the CommonData project at http://www.CodePlex.Com/CommonData. I will be updating my HTML comment handler.

  10. Avatar for haacked
    haacked November 11th, 2008

    @Filini Yes, I am assuming XHTML. But you're right about HTML 4.01 Transitional. I tried it out and it indeed validates.
    Thanks for the note. For those that are interested, here's the HTML I tried in the W3C validator:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head><title></title></head>
    <body>
    <p title="<p />"></p>
    </body>
    </html>



  11. Avatar for jmbr
    jmbr November 11th, 2008

    I find that HTML Tidy tends to come in handy for these tasks.

  12. Avatar for Filini
    Filini November 11th, 2008

    @Haaked, in this case, your problem is not removing the comments (as I said in my earlier comment, XmlDocument is an easy solution).
    Your problem is getting a valid XHTML in the first place, and that is a road to hell... I mean, an interesting challenge :-)