HTML Stripping Challenge

Nov 11, 2008 code html suggest edit

UPDATE: I added three new unit tests and one interesting case in which the three browser render something differently.

Well I’m back at it, but this time I want to strip all HTML from a string. Specifically:

Remove all HTML opening and self-closing

Found a typo or mistake in the post? suggest edit

Comments

30 responses

Craig • November 11th, 2008
Here's how I used to do it in Delphi. Someone can convert to C# if they can be bothered. This solution is probably not that scalable as it creates an instance if the WebBrowser every time but it is pretty reliable.
function HtmlToText(const _html: string): string; var WebBrowser: TWebBrowser; Document: IHtmlDocument2; Doc: OleVariant; v: Variant; Body: IHTMLBodyElement; TextRange: IHTMLTxtRange; begin Result := ''; WebBrowser := TWebBrowser.Create(nil); try Doc := 'about:blank'; WebBrowser.Navigate2(Doc); Document := WebBrowser.Document as IHtmlDocument2; if (Assigned(Document)) then begin v := VarArrayCreate([0, 0], varVariant); v[0] := _html; Document.Write(PSafeArray(TVarData(v).VArray)); Document.Close; Body := Document.body as IHTMLBodyElement; TextRange := Body.createTextRange; Result := TextRange.text; end; finally WebBrowser.Free; end; end;
Martin Hyldahl • November 11th, 2008
This isn't that scalable either, but for HTML parsing needs you could use the HTML Agility Pack library by Simon Mourier.
http://www.codeplex.com/htmlagilitypack
From memory a html stripping method could look something like this:
public static class Html
{
public static StripHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadFromText(html):
// Maybe a inner trim function is needed aswell
return System.Web.HttpUtility.HtmlDecode(doc.DocumentNode.InnerText.Trim());
}
}
Btw. when stripping html tags, you might also want to decode html encoded characters like ø, æ, å etc...
J.D Pihl • November 11th, 2008
Ask and you shall receive..
(As a disclaimer, I just threw some regex at it to pass your little test, I wouldn't trust it as far as I can throw it or vouch for it in any way.)
public static string StripHtml(string html) { if (html == null) throw new ArgumentNullException("html"); var re = new Regex(@"<[\/!A-z]+(?:.*?(?:=\s?(?:(""|')[^\1]*\1|[^\s>]*))?)+(?:>|(<))"); return re.Replace(html, @"${2}"); }
haacked • November 11th, 2008
@J.D Pihl nicely done! That looks very similar to my HTML matching regex way back when. But now I realize that mine needs some improvements.
Also, I thought of a few more tests I should add. :)
J.D Pihl • November 11th, 2008
Haha.. thanks.. I have something similar to your matching regex in some html -> xhtml cleanup project somewhere.
Really is a terrible experience working with html parsing. :)
J.D Pihl • November 11th, 2008
<meta<!> http-equiv="refresh" content="0;url=http://mymalicioussite.com/" /> <meta http-equiv="refresh" content="0;url=http://mymalicioussite.com/" />

A quick couple suggestions that you might want to add.
haacked • November 11th, 2008
Hi J.D. When posting URLs, Subtext converts them to links. That's interfering with what you're trying to post. I tried to fix up your first comment. Let me know if I got anything wrong. Thanks!
haacked • November 11th, 2008
In any case, after stripping the HTML, it should be safe to Html Encode it, since there's theoretically no HTML left. That will ensure that no HTML sneaks through.
haacked • November 11th, 2008
Ok, I added a couple of new test cases and added one very interesting case complete with screenshots.
Speednet • November 11th, 2008
Fun challenge! I was not aware of all the screwy rules for HTML tag matching, so it's been a learning experience too. So thanks!
Here's my Html class, written in VB. It passed all the tests. Here are the key points describing my code:
1. Uses a Regex that is compiled once at program startup, so that it will execute very quickly each time it's run, with any Regex compile hit happening during program load.
2. Uses atomic grouping in the HTML tag contents, which makes the Regex match happen very quickly and efficiently (no unnecessary backtracking).
3. Two uses of Regex conditionals -- something that people may not be used to seeing, but can be very useful. The first use is for not grabbing a "<" character that ends an HTML tag, but grabbing a ">" if that's the character that ends the tag. The second use matches a whitespace char after a comment only if a whitespace char was not matched before the comment.
4. In the Replace() call, both $1 and $2 will be empty for an HTML tag, but if it's a comment then one or the other (or neither) of $1 and $2 will be a single whitespace char.
5. It was not in the rules of the challenge, but this could also be modified fairly easily to treat HTML tags surrounded by whitespace in the same manner that comments are. (Interwoven with text, with only one intermediate space remaining after a replacement.)
I hope the code comes out OK in your blog comments. If it doesn't, please let me know the best way to do it. (And consider adding preview! ;-)
-Todd ("Speednet")

Public Class Html Private Shared ReadOnly _Regex_HTML As New Regex("[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|(\s)*]*>(?(1)\s*|(\s)*)", RegexOptions.Compiled Or RegexOptions.IgnoreCase) Public Shared Function StripHtml(ByVal html As String) As String Return _Regex_HTML.Replace(html, "$1$2") End Function End Class
Speednet • November 11th, 2008
Ugh. It messed up my Regex Pattern.
I'll try again, with just the Regex definition:
Private Shared ReadOnly _Regex_HTML As New Regex("</?(?=[a-z])(?>[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|(\s)*<![^>]*>(?(1)\s*|(\s)*)", RegexOptions.Compiled Or RegexOptions.IgnoreCase)
Colin • November 11th, 2008
Won't you need to be careful about the code page that is being read in and ultimately the page it will be written out as otherwise you could miss the utf-7 xss attacks?
configurator • November 11th, 2008
I wonder why the post's RSS title became (at list when watching in iGoogle):

Editor: Snipped
Steve Wagner • November 11th, 2008
You can also use the opensource http://wiki.developer.mindtouch.com/SgmlReader. It is an html reader which exposes the api of an xmlreader. So you can create an XmlDocument from any html.
YaronD • November 11th, 2008
I just tried the one who rendered differently on FF for you, and my FF (3.0.3) rendered it just like IE and Google Chrome did for you.
Opera, BTW, also agrees.
haacked • November 12th, 2008
@configurator I noticed that iGoogle attempts to put a snippet of the blog post into the title attribute of the link so it shows up as a tooltip. Maybe they're running into similar problems I did when stripping HTML? ;)
David S • November 12th, 2008
You guys have way too much time on your hands. =P
Seriously though, this would make an awesome little library in codeplex for HTML Parsing.
pcdinh • November 12th, 2008
In PHP you can simply use strip_tags(). .NET is too complicated and defficent.
configurator • November 12th, 2008
Could be... Funny that the only time I see that bug is when speaking about how hard HTML stripping is :)
haacked • November 12th, 2008
@pcdinh I wonder if that method passes all these tests. Care to verify?
Greg • November 12th, 2008
Use sed with a decent regular expression would be much easier and much much faster than .NET code.
I've done this extensively to get, reformat and extract data html web pages for insertion into a SQL server database. It's much easier and much less error prone than writing your own .NET code.
I used it to remove javascript and simplify formatting of html (e.g, replacing all <table ...=""> that have lots of options set with <table>).
sed faq: http://www.grymoire.com/Unix/Sed.html
configurator • November 12th, 2008
@Greg, how is using sed less data prone than running the regex in .NET?
Also, the title in iGoogle is now somehow fixed! :)
Adam • November 13th, 2008
@haacked: I found a port of strip_tags for c# and it passed 12/17 of the first tests you had displayed before adding the new ones.
I'm actually using the code in one of my sites and it seems to work fine but then again i'm only dealing with stripping the code from a wysiwyg editor.
Here it is for anyone interested.
public static string StripTags(this string str) { return str.StripTags(""); } public static string StripTags(this string str, string allowed_tags) { string pattern_for_all_tags = "<]+>"; // pattern for allowed tags string allowed_patterns = ""; if (allowed_tags != "") { // get allowed tags if any exists Regex r = new Regex("[\\/<> ]+"); allowed_tags = r.Replace(allowed_tags, ""); string[] allowed_tags_array = allowed_tags.Split(','); foreach (string s in allowed_tags_array) { if (s == "") continue; // Definin patterns string p_1 = "<" + s + " [^><]*>$"; string p_2 = "<" + s + ">"; string p_3 = ""; if (allowed_patterns != "") allowed_patterns += "|"; allowed_patterns += p_1 + "|" + p_2 + "|" + p_3; } } // Get all html tags included on string Regex strip_tags = new Regex(pattern_for_all_tags); MatchCollection all_tags_matched = strip_tags.Matches(str); if (allowed_patterns != "") foreach (Match m in all_tags_matched) { Regex r_1 = new Regex(allowed_patterns); Match m_1 = r_1.Match(m.Value); if (!m_1.Success) { // if not allowed replace it str = str.Replace(m.Value, ""); } } else // if not allow anyone replace all str = strip_tags.Replace(str, ""); return str; }
Speednet • November 15th, 2008
@haacked: I was curious if you took a look at my solution. I had assumed when you said "challenge" that you would be interested in those posts where a actual programmed solution was presented. Pardon my utter lack of humbleness, but I think my solution was quite elegant in its conciseness and ability to flexibly handle each scenario, no?
Speednet • November 15th, 2008
@haacked: I found another situation that requires another rule above, and I also have a minor quibble with one of your rules.
First, the quibble. In your "WithCommentInterleavedWithText" test for removing comments, your test results show that if a comment is surrounded by a space on each side (as in that test), after the replacement only one of the spaces remains.
While it may appear that way on the page (two consecutive spaces appear as one space in the rendered page), in the DOM the two separate text nodes remain, one with a trailing space and the other with a leading space. Thus, Html.StripHtml("Hello <!--> World") should return "Hello--World" (dashes = spaces), not "Hello-World".
(As a caveat, Internet Explorer autmatically normalizes the node list after removing any nodes, in this case resulting in the two text nodes merging into one, with a single separating space, but standards-based browsers like Firefox correctly keep them as two separate text nodes.)
The additional exception I found is that script tags are handled in a special manner in browsers, so they must be dealt with differently than the regular tags and comments.
Through testing, I have found that once a script tag begins, it will never stop consuming text until it finds a </script> tag. So if there is an HTML comment embedded within the script tag, as happens with most script tags, then the comment inside the script tag will be incorrectly stripped by any HTML stripper that does not treat script tags as a special case. (All the text after the first right-angle character (">") is found will be left in the string.)
Here is a test I concocted for script tags (I'm also displaying my VB bias):
<TestMethod()> _ Public Sub Html_ScriptWithEmbeddedRightAngle_ReturnsEmptyString() Dim s As String = "<script>//<![CDATA[" & vbCrLf & "alert('>');//]]></script>" Assert.AreEqual(String.Empty, Html.StripHtml(s)) End Sub
So, as a result of the two items above, I have updated my original code so that it (a) does not attempt to consolidate spaces surrounding stripped comments, and (b) handles the special case of script tags.
(Incidentally, I found out one other interesting tidbit: It seems that the only HTML tag that cannot be closed by starting a new tag before the last right-angle character is the ending </script> tag. i.e., <a href="#">Hello</a<br /> does display a link, but </script<br /> will not end a script.)
Public Class Html Private Shared ReadOnly _Regex_HTML As New Regex("<script(?=[\s<>])(?>[^<]|<(?!/script[\s>]))*</script(?>=\s*""[^""]*""|=\s*'[^']*'|[^<>])*(?(?=<)|>)|</?(?=[a-z])(?>[=]\s*""[^""]*""|=\s*'[^']*'|[^<>])+(?(?=<)|>)|<![^>]*>", RegexOptions.Compiled Or RegexOptions.IgnoreCase) Public Shared Function StripHtml(ByVal html As String) As String Return _Regex_HTML.Replace(html, "") End Function End Class
-Todd
haacked • November 16th, 2008
@Speednet. If you view source, you'll see there are indeed two spaces. It's my HTML markup that's incorrect, not the actual test. ;)
Sorry about that!
Speednet • November 16th, 2008
Well, I guess I came to the same conclusion -- the hard way! ;-)
balang • November 16th, 2008
@speednet your second regex works...thankz
Steve C • November 20th, 2008
Anyone care to extend this regex to include support for excluding tags from being stripped?
public static string StripHtml(string html, params string[] exclusions)
...would be the signature. This would return the string with all tags stripped except those defined in the exclusions array.
Been trying to work on this one and it's giving me a bit of trouble.
Doug • November 26th, 2008
Here is a regex (with the 80/20 rule in mind)
/<\/?[^>]+>/gi