HTML Stripping Challenge

code 0 comments suggest edit

UPDATE: I added three new unit tests and one interesting case in which the three browser render something differently.

Well I’m back at it, but this time I want to strip all HTML from a string. Specifically:

  • Remove all HTML opening and self-closing tags: Thus <foo> and <foo /> should be stripped.
  • Remove all HTML closing tags such as </p>.
  • Remove all HTML comments.
  • Do not strip any text in between tags that would be rendered by the browser.

This may not sound all that difficult, but I have a feeling that many existing implementations out there would not pass the set of unit tests I wrote to verify this behavior.

I’ll present some pathological cases to demonstrate some of the odd edge cases.

We’ll start with a relative easy one.

<foo title=">" />

Notice that this tag contains the closing angle bracket within an attribute value. That closing bracket should not close the tag. Only the one outside the attribute value should close it. Thus, the method should return an empty string in this case.

Here’s another case:

<foo =test>title />

This is a non quoted attribute value, but in this case, the inner angle bracket should close the tag, leaving you with “title />”.

Here’s a case that surprised me.

<foo<>Test

That one strips everything except “<>Test”.

It gets even better…

<foo<!>Test

strips out everything except “Test”.

And finally, here’s one that’s a real doozy.

<foo<!--foo>Test

Check out how FireFox, IE, and Google Chrome render this same piece of markup.

weird-markup-chrome

weird-markup-ff

weird-markup-ie

One of these kids is doing his own thing! ;) For my unit test, I decided to go with majority rules here (I did not test with Opera) and went with the behavior of the two rather than Firefox.

The Challenge

The following is the shell of a method for stripping HTML from a string based on the requirements listed above.

public static class Html {
  public static string StripHtml(string html) {
    throw new NotImplementedException("implement this");
  }
}

Your challenge, should you choose to accept it, is to implement this method such that the following unit tests pass. I apologize for the small font, but I used some long names and wanted it to fit.

[TestMethod]
public void NullHtml_ThrowsArgumentException() {
    try {
        Html.StripHtml(null);
        Assert.Fail();
    }
    catch (ArgumentNullException) {
    }
}

[TestMethod]
public void Html_WithEmptyString_ReturnsEmpty() {
    Assert.AreEqual(string.Empty, Html.StripHtml(string.Empty));
}

[TestMethod]
public void Html_WithNoTags_ReturnsTextOnly() {
    string html = "This has no tags!";
    Assert.AreEqual(html, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithOnlyATag_ReturnsEmptyString() {
    string html = "<foo>";
    Assert.AreEqual(string.Empty, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithOnlyConsecutiveTags_ReturnsEmptyString() {
    string html = "<foo><bar><baz />";
    Assert.AreEqual(string.Empty, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTextBeforeTag_ReturnsText() {
    string html = "Hello<foo>";
    Assert.AreEqual("Hello", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTextAfterTag_ReturnsText() {
    string html = "<foo>World";
    Assert.AreEqual("World", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTextBetweenTags_ReturnsText() {
    string html = "<p><foo>World</foo></p>";
    Assert.AreEqual("World", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithClosingTagInAttrValue_StripsEntireTag() {
    string html = "<foo title=\"/>\" />";
    Assert.AreEqual(string.Empty, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTagClosedByStartTag_StripsFirstTag() {
    string html = "<foo <>Test";
    Assert.AreEqual("<>Test", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithSingleQuotedAttrContainingDoubleQuotesAndEndTagChar_StripsEntireTag() { 
    string html = @"<foo ='test""/>title' />";
    Assert.AreEqual(string.Empty, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithDoubleQuotedAttributeContainingSingleQuotesAndEndTagChar_StripsEntireTag() {
    string html = @"<foo =""test'/>title"" />";
    Assert.AreEqual(string.Empty, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithNonQuotedAttribute_StripsEntireTagWithoutStrippingText() {
    string html = @"<foo title=test>title />";
    Assert.AreEqual("title />", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithNonQuotedAttributeContainingDoubleQuotes_StripsEntireTagWithoutStrippingText() {
    string html = @"<p title = test-test""-test>title />Test</p>";
    Assert.AreEqual("title />Test", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithNonQuotedAttributeContainingQuotedSection_StripsEntireTagWithoutStrippingText() {
    string html = @"<p title = test-test""- >""test> ""title />Test</p>";
    Assert.AreEqual(@"""test> ""title />Test", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTagClosingCharInAttributeValueWithNoNameFollowedByText_ReturnsText() {
    string html = @"<foo = "" />title"" />Test";
    Assert.AreEqual("Test", Html.StripHtml(html));
}

[TestMethod]
public void Html_WithTextThatLooksLikeTag_ReturnsText() {
    string html = @"<çoo = "" />title"" />Test";
    Assert.AreEqual(html, Html.StripHtml(html));
}

[TestMethod]
public void Html_WithCommentOnly_ReturnsEmptyString() {
    string s = "<!-- this go bye bye>";
    Assert.AreEqual(string.Empty, Html.StripHtml(s));
}

[TestMethod]
public void Html_WithNonDashDashComment_ReturnsEmptyString() {
    string s = "<! this go bye bye>";
    Assert.AreEqual(string.Empty, Html.StripHtml(s));
}

[TestMethod]
public void Html_WithTwoConsecutiveComments_ReturnsEmptyString() {
    string s = "<!-- this go bye bye><!-- another comment>";
    Assert.AreEqual(string.Empty, Html.StripHtml(s));
}

[TestMethod]
public void Html_WithTextBeforeComment_ReturnsText() {
    string s = "Hello<!-- this go bye bye -->";
    Assert.AreEqual("Hello", Html.StripHtml(s));
}

[TestMethod]
public void Html_WithTextAfterComment_ReturnsText() {
    string s = "<!-- this go bye bye -->World";
    Assert.AreEqual("World", Html.StripHtml(s));
}

[TestMethod]
public void Html_WithAngleBracketsButNotHtml_ReturnsText() {
    string s = "<$)*(@&$(@*>";
    Assert.AreEqual(s, Html.StripHtml(s));
}

[TestMethod]
public void Html_WithCommentInterleavedWithText_ReturnsText() {
    string s = "Hello <!-- this go bye bye --> World <!--> This is fun";
    Assert.AreEqual("Hello  World  This is fun", Html.StripHtml(s));
}

[TestMethod]
public void Html_WithCommentBetweenNonTagButLooksLikeTag_DoesStripComment() {
    string s = @"<ç123 title=""<!bc def>"">";
    Assert.AreEqual(@"<ç123 title="""">", Html.StripHtml(s));
}


[Test]
public void Html_WithTagClosedByStartComment_StripsFirstTag()
{
    //Note in Firefox, this renders: <!--foo>Test
    string html = "<foo<!--foo>Test";
    Assert.AreEqual("Test", HtmlHelper.RemoveHtml(html));
}

[Test]
public void Html_WithTagClosedByProperComment_StripsFirstTag()
{
    string html = "<FOO<!-- FOO -->Test";
    Assert.AreEqual("Test", HtmlHelper.RemoveHtml(html));
}

[Test]
public void Html_WithTagClosedByEmptyComment_StripsFirstTag()
{
    string html = "<foo<!>Test";
    Assert.AreEqual("Test", HtmlHelper.RemoveHtml(html));
}

What’s the moral of this story, apart from “Phil has way too much time on his hands?” In part, it’s that parsing HTML is fraught with peril. I wouldn’t be surprised if there are some cases here that I’m missing. If so, let me know. I used FireFox’s DOM Explorer to help verify the behavior I was seeing.

I think this is also another example of the challenges of software development in general along with the 80-20 rule. It’s really easy to write code that handles 80% of the cases. Most of the time, that’s good enough. But when it comes to security code, even 99% is not good enough, as hackers will find that 1% and exploit it.

In any case, I think I’m really done with this topic for now. I hope it was worthwhile. And as I said, I’ll post my code solution to this later. Let me know if you find missing test cases.

Technorati Tags: html,parsing

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.

Comments

avatar

30 responses

  1. Avatar for Craig
    Craig November 11th, 2008

    Here's how I used to do it in Delphi. Someone can convert to C# if they can be bothered. This solution is probably not that scalable as it creates an instance if the WebBrowser every time but it is pretty reliable.

    function HtmlToText(const _html: string): string;
    var WebBrowser: TWebBrowser;
    Document: IHtmlDocument2;
    Doc: OleVariant;
    v: Variant;
    Body: IHTMLBodyElement;
    TextRange: IHTMLTxtRange;
    begin
    Result := '';
    WebBrowser := TWebBrowser.Create(nil);
    try
    Doc := 'about:blank';
    WebBrowser.Navigate2(Doc);
    Document := WebBrowser.Document as IHtmlDocument2;
    if (Assigned(Document)) then
    begin
    v := VarArrayCreate([0, 0], varVariant);
    v[0] := _html;
    Document.Write(PSafeArray(TVarData(v).VArray));
    Document.Close;
    Body := Document.body as IHTMLBodyElement;
    TextRange := Body.createTextRange;
    Result := TextRange.text;
    end;
    finally
    WebBrowser.Free;
    end;
    end;

  2. Avatar for Martin Hyldahl
    Martin Hyldahl November 11th, 2008

    This isn't that scalable either, but for HTML parsing needs you could use the HTML Agility Pack library by Simon Mourier.
    http://www.codeplex.com/htmlagilitypack
    From memory a html stripping method could look something like this:
    public static class Html
    {
    public static StripHtml(string html)
    {
    HtmlDocument doc = new HtmlDocument();
    doc.LoadFromText(html):
    // Maybe a inner trim function is needed aswell
    return System.Web.HttpUtility.HtmlDecode(doc.DocumentNode.InnerText.Trim());
    }
    }
    Btw. when stripping html tags, you might also want to decode html encoded characters like ø, æ, å etc...

  3. Avatar for J.D Pihl
    J.D Pihl November 11th, 2008

    Ask and you shall receive..
    (As a disclaimer, I just threw some regex at it to pass your little test, I wouldn't trust it as far as I can throw it or vouch for it in any way.)

    public static string StripHtml(string html)
    {
    if (html == null) throw new ArgumentNullException("html");
    var re = new Regex(@"<[\/!A-z]+(?:.*?(?:=\s?(?:(""|')[^\1]*\1|[^\s>]*))?)+(?:>|(<))");
    return re.Replace(html, @"${2}");
    }

  4. Avatar for haacked
    haacked November 11th, 2008

    @J.D Pihl nicely done! That looks very similar to my HTML matching regex way back when. But now I realize that mine needs some improvements.
    Also, I thought of a few more tests I should add. :)

  5. Avatar for J.D Pihl
    J.D Pihl November 11th, 2008

    Haha.. thanks.. I have something similar to your matching regex in some html -> xhtml cleanup project somewhere.
    Really is a terrible experience working with html parsing. :)

  6. Avatar for J.D Pihl
    J.D Pihl November 11th, 2008


    <meta<!> http-equiv="refresh" content="0;url=http://mymalicioussite.com/" />

    <meta http-equiv="refresh" content="0;url=http://mymalicioussite.com/" />


    A quick couple suggestions that you might want to add.

  7. Avatar for haacked
    haacked November 11th, 2008

    Hi J.D. When posting URLs, Subtext converts them to links. That's interfering with what you're trying to post. I tried to fix up your first comment. Let me know if I got anything wrong. Thanks!

  8. Avatar for haacked
    haacked November 11th, 2008

    In any case, after stripping the HTML, it should be safe to Html Encode it, since there's theoretically no HTML left. That will ensure that no HTML sneaks through.

  9. Avatar for haacked
    haacked November 11th, 2008

    Ok, I added a couple of new test cases and added one very interesting case complete with screenshots.

  10. Avatar for Speednet
    Speednet November 11th, 2008

    Fun challenge! I was not aware of all the screwy rules for HTML tag matching, so it's been a learning experience too. So thanks!
    Here's my Html class, written in VB. It passed all the tests. Here are the key points describing my code:
    1. Uses a Regex that is compiled once at program startup, so that it will execute very quickly each time it's run, with any Regex compile hit happening during program load.
    2. Uses atomic grouping in the HTML tag contents, which makes the Regex match happen very quickly and efficiently (no unnecessary backtracking).
    3. Two uses of Regex conditionals -- something that people may not be used to seeing, but can be very useful. The first use is for not grabbing a "<" character that ends an HTML tag, but grabbing a ">" if that's the character that ends the tag. The second use matches a whitespace char after a comment only if a whitespace char was not matched before the comment.
    4. In the Replace() call, both $1 and $2 will be empty for an HTML tag, but if it's a comment then one or the other (or neither) of $1 and $2 will be a single whitespace char.
    5. It was not in the rules of the challenge, but this could also be modified fairly easily to treat HTML tags surrounded by whitespace in the same manner that comments are. (Interwoven with text, with only one intermediate space remaining after a replacement.)
    I hope the code comes out OK in your blog comments. If it doesn't, please let me know the best way to do it. (And consider adding preview! ;-)
    -Todd ("Speednet")


    Public Class Html
    Private Shared ReadOnly _Regex_HTML As New Regex("[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|(\s)*]*>(?(1)\s*|(\s)*)", RegexOptions.Compiled Or RegexOptions.IgnoreCase)
    Public Shared Function StripHtml(ByVal html As String) As String
    Return _Regex_HTML.Replace(html, "$1$2")
    End Function
    End Class

  11. Avatar for Speednet
    Speednet November 11th, 2008

    Ugh. It messed up my Regex Pattern.
    I'll try again, with just the Regex definition:

    Private Shared ReadOnly _Regex_HTML As New Regex("</?(?=[a-z])(?>[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|(\s)*<![^>]*>(?(1)\s*|(\s)*)", RegexOptions.Compiled Or RegexOptions.IgnoreCase)

  12. Avatar for Colin
    Colin November 11th, 2008

    Won't you need to be careful about the code page that is being read in and ultimately the page it will be written out as otherwise you could miss the utf-7 xss attacks?

  13. Avatar for configurator
    configurator November 11th, 2008

    I wonder why the post's RSS title became (at list when watching in iGoogle):



    Editor: Snipped

  14. Avatar for Steve Wagner
    Steve Wagner November 11th, 2008

    You can also use the opensource http://wiki.developer.mindtouch.com/SgmlReader. It is an html reader which exposes the api of an xmlreader. So you can create an XmlDocument from any html.

  15. Avatar for YaronD
    YaronD November 11th, 2008

    I just tried the one who rendered differently on FF for you, and my FF (3.0.3) rendered it just like IE and Google Chrome did for you.
    Opera, BTW, also agrees.

  16. Avatar for haacked
    haacked November 12th, 2008

    @configurator I noticed that iGoogle attempts to put a snippet of the blog post into the title attribute of the link so it shows up as a tooltip. Maybe they're running into similar problems I did when stripping HTML? ;)

  17. Avatar for David S
    David S November 12th, 2008

    You guys have way too much time on your hands. =P
    Seriously though, this would make an awesome little library in codeplex for HTML Parsing.

  18. Avatar for pcdinh
    pcdinh November 12th, 2008

    In PHP you can simply use strip_tags(). .NET is too complicated and defficent.

  19. Avatar for configurator
    configurator November 12th, 2008

    Could be... Funny that the only time I see that bug is when speaking about how hard HTML stripping is :)

  20. Avatar for haacked
    haacked November 12th, 2008

    @pcdinh I wonder if that method passes all these tests. Care to verify?

  21. Avatar for Greg
    Greg November 12th, 2008

    Use sed with a decent regular expression would be much easier and much much faster than .NET code.
    I've done this extensively to get, reformat and extract data html web pages for insertion into a SQL server database. It's much easier and much less error prone than writing your own .NET code.
    I used it to remove javascript and simplify formatting of html (e.g, replacing all <table ...=""> that have lots of options set with <table>).
    sed faq: http://www.grymoire.com/Unix/Sed.html

  22. Avatar for configurator
    configurator November 12th, 2008

    @Greg, how is using sed less data prone than running the regex in .NET?
    Also, the title in iGoogle is now somehow fixed! :)

  23. Avatar for Adam
    Adam November 13th, 2008

    @haacked: I found a port of strip_tags for c# and it passed 12/17 of the first tests you had displayed before adding the new ones.
    I'm actually using the code in one of my sites and it seems to work fine but then again i'm only dealing with stripping the code from a wysiwyg editor.
    Here it is for anyone interested.

    public static string StripTags(this string str)
    {
    return str.StripTags("");
    }
    public static string StripTags(this string str, string allowed_tags)
    {
    string pattern_for_all_tags = "<]+>";
    // pattern for allowed tags
    string allowed_patterns = "";
    if (allowed_tags != "")
    {
    // get allowed tags if any exists
    Regex r = new Regex("[\\/<> ]+");
    allowed_tags = r.Replace(allowed_tags, "");
    string[] allowed_tags_array = allowed_tags.Split(',');
    foreach (string s in allowed_tags_array)
    {
    if (s == "") continue;
    // Definin patterns
    string p_1 = "<" + s + " [^><]*>$";
    string p_2 = "<" + s + ">";
    string p_3 = "";
    if (allowed_patterns != "")
    allowed_patterns += "|";
    allowed_patterns += p_1 + "|" + p_2 + "|" + p_3;
    }
    }
    // Get all html tags included on string
    Regex strip_tags = new Regex(pattern_for_all_tags);
    MatchCollection all_tags_matched = strip_tags.Matches(str);
    if (allowed_patterns != "")
    foreach (Match m in all_tags_matched)
    {
    Regex r_1 = new Regex(allowed_patterns);
    Match m_1 = r_1.Match(m.Value);
    if (!m_1.Success)
    {
    // if not allowed replace it
    str = str.Replace(m.Value, "");
    }
    }
    else
    // if not allow anyone replace all
    str = strip_tags.Replace(str, "");
    return str;
    }

  24. Avatar for Speednet
    Speednet November 15th, 2008

    @haacked: I was curious if you took a look at my solution. I had assumed when you said "challenge" that you would be interested in those posts where a actual programmed solution was presented. Pardon my utter lack of humbleness, but I think my solution was quite elegant in its conciseness and ability to flexibly handle each scenario, no?

  25. Avatar for Speednet
    Speednet November 15th, 2008

    @haacked: I found another situation that requires another rule above, and I also have a minor quibble with one of your rules.
    First, the quibble. In your "WithCommentInterleavedWithText" test for removing comments, your test results show that if a comment is surrounded by a space on each side (as in that test), after the replacement only one of the spaces remains.
    While it may appear that way on the page (two consecutive spaces appear as one space in the rendered page), in the DOM the two separate text nodes remain, one with a trailing space and the other with a leading space. Thus, Html.StripHtml("Hello <!--> World") should return "Hello--World" (dashes = spaces), not "Hello-World".
    (As a caveat, Internet Explorer autmatically normalizes the node list after removing any nodes, in this case resulting in the two text nodes merging into one, with a single separating space, but standards-based browsers like Firefox correctly keep them as two separate text nodes.)
    The additional exception I found is that script tags are handled in a special manner in browsers, so they must be dealt with differently than the regular tags and comments.
    Through testing, I have found that once a script tag begins, it will never stop consuming text until it finds a </script> tag. So if there is an HTML comment embedded within the script tag, as happens with most script tags, then the comment inside the script tag will be incorrectly stripped by any HTML stripper that does not treat script tags as a special case. (All the text after the first right-angle character (">") is found will be left in the string.)
    Here is a test I concocted for script tags (I'm also displaying my VB bias):

    <TestMethod()> _
    Public Sub Html_ScriptWithEmbeddedRightAngle_ReturnsEmptyString()
        Dim s As String = "<script>//<![CDATA[" & vbCrLf & "alert('>');//]]></script>"
        Assert.AreEqual(String.Empty, Html.StripHtml(s))
    End Sub

    So, as a result of the two items above, I have updated my original code so that it (a) does not attempt to consolidate spaces surrounding stripped comments, and (b) handles the special case of script tags.
    (Incidentally, I found out one other interesting tidbit: It seems that the only HTML tag that cannot be closed by starting a new tag before the last right-angle character is the ending </script> tag. i.e., <a href="#">Hello</a<br /> does display a link, but </script<br /> will not end a script.)

    Public Class Html
        Private Shared ReadOnly _Regex_HTML As New Regex("<script(?=[\s<>])(?>[^<]|<(?!/script[\s>]))*</script(?>=\s*""[^""]*""|=\s*'[^']*'|[^<>])*(?(?=<)|>)|</?(?=[a-z])(?>[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|<![^>]*>", RegexOptions.Compiled Or RegexOptions.IgnoreCase)
        Public Shared Function StripHtml(ByVal html As String) As String
            Return _Regex_HTML.Replace(html, "")
        End Function
    End Class

    -Todd

  26. Avatar for haacked
    haacked November 16th, 2008

    @Speednet. If you view source, you'll see there are indeed two spaces. It's my HTML markup that's incorrect, not the actual test. ;)
    Sorry about that!

  27. Avatar for Speednet
    Speednet November 16th, 2008

    Well, I guess I came to the same conclusion -- the hard way! ;-)

  28. Avatar for balang
    balang November 16th, 2008

    @speednet your second regex works...thankz

  29. Avatar for Steve C
    Steve C November 20th, 2008

    Anyone care to extend this regex to include support for excluding tags from being stripped?

    public static string StripHtml(string html, params string[] exclusions)

    ...would be the signature. This would return the string with all tags stripped except those defined in the exclusions array.
    Been trying to work on this one and it's giving me a bit of trouble.

  30. Avatar for Doug
    Doug November 26th, 2008

    Here is a regex (with the 80/20 rule in mind)
    /<\/?[^>]+>/gi