Matching HTML With Regular Expressions Redux
UPDATE: Mea culpa! Maurice pointed out
that (except for the casing) my original expression WAS correct. I only
needed the RegexOptions.SingleLine
option. I didn’t need to add the
(\s|\n)
everywhere. Here’s a corrected post. Thanks Maurice!
Last time I talked about matching HTML with regular expressions, I published a regular expression with a couple small bugs. The first bug was not my fault, but rather the fault of the rich text editor that comes with .TEXT. It was being overly “helpful” when I tried to edit the post and uppercased some of the code. As you know, “\S” is much different than “\s” to a regular expression.
The second bug is entirely my fault and I write this as a confession and to provide a fix. You see, I assumed (and you know what happens when we assume) that complete tags tend to be on a single line. Well that’s not the always the case. You might encounter something ugly like this:
<div id = "blah" alt=" man this is ugly html " > fire this guy... </div>
The expression I had posted wouldn’t have matched the div tag sitting in
plain sight there so I went in there and corrected that sucker all
by itself. It requires using the RegexOptions.SingleLine option so that
the . character matches \n. Here’s the expression reprinted (with
correct casing) for your reference.
</?\w+((\s+\w+(\s\*=\s\*(?:".\*?"|'.\*?'|[\^'">\s]+))?)+\s\*|\s\*)/?>
The main difference is now I’m including \n anywhere I’m matching
whitespace (via \s). In order for this to work, you need to use the
RegexOption SingleLine. Here’s a code snippet that uses this
expression to match the above html.
string html = "<div\n\tid=\"blah\" alt=\" man\n"
+ "\tthis is ugly html \"\n"
+ "\t>"
+ "fire this guy...\n"`
+ "</div>";
Regex regex = new
Regex(@"\</?\w+((\s+\w+(\s\*=\s\*(?:"".\*?""|'.\*?'|[\^'""\>\s]+))?)+\s\*|\s\*)/?\>",
RegexOptions.Singleline);
MatchCollection matches = regex.Matches(html);
Console.WriteLine(".....Original Html......");
Console.WriteLine(html + Environment.NewLine);
Console.WriteLine(".....Each Tag......");
foreach(Match match in matches)
{
Console.WriteLine("TAG: " + match.Value.Replace("\n", " "));
}
This produces the output:
> .....Original Html......<div id="blah" alt=" man this is ugly html " >fire this guy...</div>.....Each Tag......TAG: <div id="blah" alt=" man this is ugly html " >TAG: </div>
So sorry about that. Hope this one treats you better.
Comments
0 responses