Matching HTML With Regular Expressions Redux

UPDATE: Mea culpa! Maurice pointed out that (except for the casing) my original expression WAS correct. I only needed the RegexOptions.SingleLine option. I didn't need to add the (\s|\n) everywhere. Here's a corrected post. Thanks Maurice!

Last time I talked about matching HTML with regular expressions, I published a regular expression with a couple small bugs. The first bug was not my fault, but rather the fault of the rich text editor that comes with .TEXT. It was being overly “helpful” when I tried to edit the post and uppercased some of the code. As you know, “\S” is much different than “\s” to a regular expression.

The second bug is entirely my fault and I write this as a confession and to provide a fix. You see, I assumed (and you know what happens when we assume) that complete tags tend to be on a single line. Well that's not the always the case. You might encounter something ugly like this:

<div     id = "blah" alt=" man
this is ugly html "
> fire this guy... </div>

The expression I had posted wouldn't have matched the div tag sitting in plain sight there so I went in there and corrected that sucker all by itself. It requires using the RegexOptions.SingleLine option so that the . character matches \n. Here's the expression reprinted (with correct casing) for your reference.

</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>

The main difference is now I'm including \n anywhere I'm matching whitespace (via \s). In order for this to work, you need to use the RegexOption SingleLine. Here's a code snippet that uses this expression to match the above html.

string html = "<div\n\tid=\"blah\" alt=\" man\n"

    + "\tthis is ugly html \"\n"

    + "\t>"

    + "fire this guy...\n"

    + "</div>";

 

Regex regex = new Regex(@"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);

 

MatchCollection matches = regex.Matches(html);

Console.WriteLine(".....Original Html......");

Console.WriteLine(html + Environment.NewLine);

Console.WriteLine(".....Each Tag......");

foreach(Match match in matches)

{

    Console.WriteLine("TAG: " + match.Value.Replace("\n", " "));

}

This produces the output:

.....Original Html......
<div
id="blah" alt=" man
this is ugly html "
>fire this guy...
</div>

.....Each Tag......
TAG: <div id="blah" alt=" man this is ugly html " >
TAG: </div>

So sorry about that. Hope this one treats you better.

[ad] Free Bug Tracking & Project Management Software Axosoft’s OnTime 2007 allows software development teams to collaborate on software projects by tracking everything from defects to enhancements to helpdesk incidents in one easy-to-use database driven by an intuitive Windows, Web or VS.NET Integrated UI. Get a Free Single-User License ($200 Value!)

What others have said

Requesting Gravatar... Maurice Apr 30, 2005 10:11 AM
# re: Matching HTML With Regular Expressions Redux
I'm not quite sure why you had to add \n as an alternation value. If you set the SingleLine option, newlines are picked up appropriately without having to futher complicate the expression itself. Or am I misunderstanding your fix here?
Requesting Gravatar... Haacked May 01, 2005 6:00 PM
# re: Matching HTML With Regular Expressions Redux
I'm adding it as an alternative value for \s (whitespace). The SingeLine option changes the meaning of the . character so it includes \n, but doesn't change the meaning of the \s character as far as I know and according to MSDN.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemtextregularexpressionsregexoptionsclasstopic.asp
Requesting Gravatar... Maurice May 02, 2005 12:35 AM
# re: Matching HTML With Regular Expressions Redux
My point exactly... Your original expression works quite well once you turn single line mode. It was failing not with the whitespace but rather the dot expression.

The additional alternation just adds complexity without buying you anything. Why? The \s syntax is nothing more than a short cut for a character class containing \n. Thus, you're effectively repeating yourself.

Try the expression "x\s*y" (no quotes) against

x

y

Even without single line mode on, you will get a match. On the other had, a dot would 'fail' without it.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconCharacterClasses.asp

Simpler expressions makes everyone happier... :)
Requesting Gravatar... Haacked May 02, 2005 10:03 AM
# re: Matching HTML With Regular Expressions Redux
Good point Maurice! I mistakenly thought \s didn't include \n. I've corrected the post! Thanks!
Requesting Gravatar... you've been HAACKED Sep 14, 2006 10:56 PM
# My New Personal Blog
My New Personal Blog
Requesting Gravatar... Dan Jan 21, 2007 2:03 PM
# re: Matching HTML With Regular Expressions Redux
Awesome. Thanks for this. It helped me out.
Requesting Gravatar... sandeep Apr 04, 2007 9:12 AM
# re: Matching HTML With Regular Expressions Redux
how will this work if any of the html attribute have a blank space in it?. For example I have an img src tag where I may have a Blank space in the src path.
Requesting Gravatar... Haacked Apr 04, 2007 9:45 AM
# re: Matching HTML With Regular Expressions Redux
Well as long as you quote the src part, it should work fine. For example:

<img src="there is a space.html" />

Will be matched just fine.

<img src=there is a space.html />

Will assume src="there" because there's no way to know whether "is" is the next attribute or part of the src value.
Requesting Gravatar... Kevin Deldycke Apr 11, 2007 12:16 PM
# re: Matching HTML With Regular Expressions Redux
By the way, if you're looking for the PHP equivalent, here is it: http://kev.coolcavemen.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/
Requesting Gravatar... james taylor Aug 21, 2007 3:50 AM
# re: Matching HTML With Regular Expressions Redux
how can i change this to match a specific tag.
Requesting Gravatar... nripin babu Sep 19, 2007 2:54 AM
# re: Matching HTML With Regular Expressions Redux
how could i get the value of id and value from a tag like
<INPUT id=Text1 value="dfgdfgdfgdfg ergdfg">
I wanted regular expressions cos theae sre to be red dynamically :(

asp.net 2.0

Please help.
Requesting Gravatar... Andrew Oct 14, 2007 11:19 PM
# re: Matching HTML With Regular Expressions Redux
Hi is there a way to get matchers for <a href tags only? or just to strip out <a *> and tags?

What do you have to say?

(will show your gravatar)
Please add 1 and 1 and type the answer here: