UPDATE: I added three new unit tests and one interesting case in which the three browser render something differently. Well I’m back at it, but this time I want to strip all HTML from a string. Specifically: Remove all HTML opening and self-closing tags: Thus <foo> and <foo /> should be stripped. Remove all HTML closing tags such as </p>. Remove all HTML comments. Do not strip any text in between tags that would be rendered by the browser. This may...
A while ago I wrote a blog post about how painful it is to properly parse an email address. This post is kind of like that, except that this time, I take on HTML. I’ve written about parsing HTML with a regular expression in the past and pointed out that it’s extremely tricky and probably not a good idea to use regular expressions in this case. In this post, I want to strip out HTML comments. Why? I had some code that uses a regular expression to strip comments from HTML, but found one of those feared...