UPDATE: Mea culpa! Maurice pointed out that (except for the casing) my original expression WAS correct. I only needed the RegexOptions.SingleLine option. I didn’t need to add the (\s|\n) everywhere. Here’s a corrected post. Thanks Maurice!

Last time I talked about matching HTML with regular expressions, I published a regular expression with a couple small bugs. The first bug was not my fault, but rather the fault of the rich text editor that comes with .TEXT. It was being overly “helpful” when I tried to edit the post and uppercased some of the code. As you know, “\S” is much different than “\s” to a regular expression.

The second bug is entirely my fault and I write this as a confession and to provide a fix. You see, I assumed (and you know what happens when we assume) that complete tags tend to be on a single line. Well that’s not the always the case. You might encounter something ugly like this:

<div     id = "blah" alt=" man    this is ugly html "     >     fire this guy... </div> 

The expression I had posted wouldn’t have matched the div tag sitting in plain sight there so I went in there and corrected that sucker all by itself. It requires using the RegexOptions.SingleLine option so that the . character matches \n. Here’s the expression reprinted (with correct casing) for your reference.

</?\w+((\s+\w+(\s\*=\s\*(?:".\*?"|'.\*?'|[\^'">\s]+))?)+\s\*|\s\*)/?>

The main difference is now I’m including \n anywhere I’m matching whitespace (via \s). In order for this to work, you need to use the RegexOption SingleLine. Here’s a code snippet that uses this expression to match the above html.

string html = "<div\n\tid=\"blah\" alt=\" man\n"
  + "\tthis is ugly html \"\n"
  + "\t>"
  + "fire this guy...\n"`
  + "</div>";

 

Regex regex = new
Regex(@"\</?\w+((\s+\w+(\s\*=\s\*(?:"".\*?""|'.\*?'|[\^'""\>\s]+))?)+\s\*|\s\*)/?\>",
RegexOptions.Singleline);
MatchCollection matches = regex.Matches(html);

Console.WriteLine(".....Original Html......");
Console.WriteLine(html + Environment.NewLine);
Console.WriteLine(".....Each Tag......");

foreach(Match match in matches)
{
  Console.WriteLine("TAG: " + match.Value.Replace("\n", " "));
}

This produces the output:

>     .....Original Html......<div    id="blah" alt=" man   this is ugly html "   >fire this guy...</div>.....Each Tag......TAG: <div     id="blah" alt=" man     this is ugly html "     >TAG: </div>

So sorry about that. Hope this one treats you better.