Matching HTML With Regular Expressions Redux
UPDATE: Mea culpa! Maurice pointed out that (except for the casing) my original expression WAS correct. I only needed the RegexOptions.SingleLine option. I didn’t need to add the (\s|\n) everywhere. Here’s a corrected post. Thanks Maurice!
Last time I talked about matching HTML with regular expressions, I published a regular expression with a couple small bugs. The first bug was not my fault, but rather the fault of the rich text editor that comes with .TEXT. It was being overly “helpful” when I tried to edit the post and uppercased some of the code. As you know, “\S” is much different than “\s” to a regular expression.
The second bug is entirely my fault and I write this as a confession and to provide a fix. You see, I assumed (and you know what happens when we assume) that complete tags tend to be on a single line. Well that’s not the always the case. You might encounter something ugly like this:
<div id = "blah" alt=" man this is ugly html " > fire this guy... </div>
The expression I had posted wouldn’t have matched the div tag sitting in
plain sight there
so I went in there and corrected that sucker all
by itself. It requires using the RegexOptions.SingleLine option so that
the . character matches \n. Here’s the expression reprinted (with
correct casing) for your reference.
</?\w+((\s+\w+(\s*=\s*(?:”.*?” ’.*?’ [\^’”>\s]+))?)+\s* \s*)/?> The main difference is now I’m including \n anywhere I’m matching
whitespace (via \s). In order for this to work, you need to use the
RegexOption SingleLine. Here’s a code snippet that uses this
expression to match the above html.
string html = “<div\n\tid=\“blah\” alt=\” man\n”
+ “\tthis is ugly html \”\n”
+ “fire this guy…\n”
Regex regex = new Regex(@”</?\w+((\s+\w+(\s*=\s*(?:””.*?””|’.*?’|[\^’””>\s]+))?)+\s*|\s*)/?>”, RegexOptions.Singleline);
MatchCollection matches = regex.Matches(html);
Console.WriteLine(html + Environment.NewLine);
foreach(Match match in matches)
Console.WriteLine(“TAG: “ + match.Value.Replace(“\n”, “ “));
This produces the output:
.....Original Html......<div id="blah" alt=" man this is ugly html " >fire this guy...</div>.....Each Tag......TAG: <div id="blah" alt=" man this is ugly html " >TAG: </div>
So sorry about that. Hope this one treats you better.