UPDATE: Mea culpa! Maurice pointed out that (except for the casing) my original expression WAS correct. I only needed the RegexOptions.SingleLine option. I didn’t need to add the (\s|\n) everywhere. Here’s a corrected post. Thanks Maurice!

Last time I talked about matching HTML with regular expressions, I published a regular expression with a couple small bugs. The first bug was not my fault, but rather the fault of the rich text editor that comes with .TEXT. It was being overly “helpful” when I tried to edit the post and uppercased some of the code. As you know, “\S” is much different than “\s” to a regular expression.

The second bug is entirely my fault and I write this as a confession and to provide a fix. You see, I assumed (and you know what happens when we assume) that complete tags tend to be on a single line. Well that’s not the always the case. You might encounter something ugly like this:

<div     id = "blah" alt=" man    this is ugly html "     >     fire this guy... </div> 

The expression I had posted wouldn’t have matched the div tag sitting in plain sight there so I went in there and corrected that sucker all by itself. It requires using the RegexOptions.SingleLine option so that the . character matches \n. Here’s the expression reprinted (with correct casing) for your reference.


The main difference is now I’m including \n anywhere I’m matching whitespace (via \s). In order for this to work, you need to use the RegexOption SingleLine. Here’s a code snippet that uses this expression to match the above html.

string html = "<div\n\tid=\"blah\" alt=\" man\n"
  + "\tthis is ugly html \"\n"
  + "\t>"
  + "fire this guy...\n"`
  + "</div>";


Regex regex = new
MatchCollection matches = regex.Matches(html);

Console.WriteLine(".....Original Html......");
Console.WriteLine(html + Environment.NewLine);
Console.WriteLine(".....Each Tag......");

foreach(Match match in matches)
  Console.WriteLine("TAG: " + match.Value.Replace("\n", " "));

This produces the output:

>     .....Original Html......<div    id="blah" alt=" man   this is ugly html "   >fire this guy...</div>.....Each Tag......TAG: <div     id="blah" alt=" man     this is ugly html "     >TAG: </div>

So sorry about that. Hope this one treats you better.