Using a Regular Expression to Match HTML

UPDATE: There was a big mistake in the above expression. Unfortunately .TEXT (trying to be helpful) munged the code I posted and uppercased some characters. I’m using FireFox to post so that I don’t get the helpful text editor. Also, the above didn’t take into account multi-line html tags. That’s been corrected now. You’ll have to use the SingleLine RegexOption for it to work.

I just love regular expressions. I mean look at the sample below.

</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>

What’s not to like?

Ok I admit, I was a bit intimidated by regular expressions when I first started off as a developer. All I needed was a Substring method and an IndexOf method and I was set. But after a few projects that required some intense text processing, I realized the power and utility of regular expressions. They should be on the tool belt of every developer. To that end, I recommend Mastering Regular Expressions by Jeffrey Friedl. This is really THE book on Regular Expressions. Reading it will make your Regex-Fu powerful.

So let’s look at a common task of matching HTML tags within the body of some text. When you initially think to parse an HTML tag, it seems quite easy. You might consider the following expression:

</?\w+\s+[^>]*>

Roughly Translated, this expression looks for the beginning tag and tag name, followed by some white-space and then anything that doesn’t end the tag.

Now this will probably work 99 times out of 100, but there’s a flaw in this expression. Do you see it? What if I asked you to match the following tag?

<img title="displays >" src="big.gif">

Hopefully you see the issue here. The expression will match

<img title="displays >

Unfortunately, this implementation is too naive. We have to consider the fact that the greater-than symbol does not end a tag if it’s within a quoted attribute value. Thus we must correctly match attributes.

Now there are four possible formats for an Html attribute

name="double quoted value"
name='single quoted value'
name=notquotedvaluewithnowhitespace
name

Each of these cases are quite simple. In the first case, you could do the following:

\w+\s*=\s*"[^"]*"

The portion "[^"]*" matches a double quote, followed by any non double quote characters, followed by a double quote. Another way to express this is to use lazy evaluation as such:

\w+\s*=\s*".*?"

The portion ".*?" uses lazy evaluation (the "lazy star") to match as few characters as possible. For example, if we had a string like so

<A name=test value="test2">

evaluating ".*" (aka greedy) would match

"test" value="test2"

However using the lazy evaluation consumes the fewest characters that match the expression, thus the first match using ".*?" would be "test" and the second match is "test2".

The full expression for matching an HTML tag is that lovely mash of characters presented at the very beginning of this post. It’s a modified version of the one presented in Friedl’s book

However I wouldn’t recommend you just plunk that down in your code. Rather, you should consider adding it to a regular expression library assembly.

Don’t know how? Well I’ll show you a code listing for an exe that when run, builds a fully compiled version of this regular expression into an assembly that you can then reference in any project. In a later installment, I’ll explain in more detail just what the code is doing and how to use the compiled assembly. How irresponsible of me not to do that now. ;)

Source Listing

Technorati Tags: ,

What others have said

Requesting Gravatar... Dimitri Glazkov Oct 26, 2004 5:25 AM
# re: Using a Regular Expression to Match HTML
Look at you go, man. That's good stuff. After all of that partisan politics junk, it's refreshing to see a good tech post :)
Requesting Gravatar... Haacked Oct 26, 2004 9:26 AM
# re: Using a Regular Expression to Match HTML
Ha ha ha... Thanks. At this point I think the two undecided people in this world will have to make up their own minds and not have me tell them what to think (though I think I would do a fine job of that).

You should hopefully see some more good techie posts coming up.
Requesting Gravatar... Pat Oct 26, 2004 10:01 AM
# re: Using a Regular Expression to Match HTML
Just to play devil's advocate for a minute, it seems like HTML is just too darned close to XML to have to parse this way. Isn't there a library out there for converting HTML into XHTML? If you can do that, you can just read the file in using XmlDocument::LoadXml(). Once you've done that, you can find your tags using an XPath query. Sorry, I just couldn't let a parsing post go by without tossing in my two cents ;)
Requesting Gravatar... Haacked Oct 26, 2004 10:05 AM
# re: Using a Regular Expression to Match HTML
Pat, you're absolutely right. There is an SGML library as well as the HTML agility pack (http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx).

However, we (RSS Bandit) found that SGML was too heavyweight, poor performing, and even a bit buggy for the simple task of searching HTML for links to RSS Feeds.

I was tasked with replacing SGML with regular expressions and it performs quite well.
Requesting Gravatar... you've been HAACKED Oct 26, 2004 10:31 AM
# Why Not Convert HTML to XML?
Requesting Gravatar... Jon Galloway Oct 26, 2004 12:13 PM
# re: Using a Regular Expression to Match HTML
Great info. I've never really understood the lazy / greedy match thing before. Thanks.

Using RegEx's on nested HTML gets more difficult. We messed with balancing groups, which are supposed to handle nested constructs, we never got it working. Every sample (MSDN, Dan Appleman's book, etc.) was the same nested perintheses thing, but didn't work for HTML.

Did you handle nesting in your HTML parsing?
Requesting Gravatar... Haacked Oct 26, 2004 1:43 PM
# re: Using a Regular Expression to Match HTML
Funny you mention this. I was just talking about this with a friend and on a newsgroup.

Regular expressions just aren't well suited for nested matching. The balanced groupings is a Microsoft innovation to regular expressions, so it's not something I've played around with much.

Since the info I needed was inside a tag, my regular expression works fine for that type of processing. You could also use it to strip all tags from a document.

If I was using it to actually parse an HTML doc (which I have some code that does), I keep track of indices and everytime I match a tag, I record the beginning and end index. Then I compare that with the previous matched tag indices and I grab the content between.
Requesting Gravatar... Simon Mourier Oct 27, 2004 11:26 PM
# re: Using a Regular Expression to Match HTML
What about comments (<!-- blablah -- >). You regex matches links in comments too, right? Same remark with <script> and <style>?

Simon (Just trying to be annoying . You know me :-)
Requesting Gravatar... Haacked Oct 28, 2004 9:30 AM
# re: Using a Regular Expression to Match HTML
Oh! You're eeeevil! You are absolutely right. It would match tags within comments.

It's easy enough to strip out comments before parsing.
Requesting Gravatar... Haacked Nov 05, 2004 10:28 AM
# re: Using a Regular Expression to Match HTML
I found a mistake. For some reason my blogging engine capitalized some characters. Also, if a tag is on multiple lines, the expression above is broken. Here's my updated one.

</?\w+((\s+\w+(\s*=\s*(?:"(.|\n)*?"|'(.|\n)*?'|[^'">\s]+))?)+\s*|\s*)/?>
Requesting Gravatar... tester Mar 02, 2005 12:51 PM
# test
<iframe src="www.mic.com"></iframe>some text
Requesting Gravatar... Ash Mar 05, 2005 2:59 AM
# re: Using a Regular Expression to Match HTML
How can I use regex to match a HTML comment

eg: <!-- blah blah blah -->

thanks
Requesting Gravatar... Jim MacDiarmid Apr 04, 2005 4:34 PM
# re: Using a Regular Expression to Match HTML
Hi,
I'm working on a template parsing engine and I was wondering how I would go about using Regex to capture text(html) between custom tags?

Jim
jim.macdiarmid@comcast.net
Requesting Gravatar... Haacked Apr 04, 2005 6:05 PM
# re: Using a Regular Expression to Match HTML
Hi Jim,

That's actually quite challenging. You could attempt to use .NET's balanced matching mechanism, but it's pretty difficult to get the grasp of and it's not standard Regex.

It really depends what you're trying to accomplish. One way I've done it is to use a regex that matches html tags and then strip all tags out.

Hope that helps.
Requesting Gravatar... Adrian Apr 14, 2005 9:30 AM
# re: Using a Regular Expression to Match HTML
Can anyone help me out? I'm trying to build a regular expression to validate an SQL full text query. So something like:

"cable television" and "wireless bluetooth" OR "3G"

to ensure all booleans operators are outside quotes, and all text is in quotes.

Is this possible with a regular expression? or the wrong approach?

Cheers!
adrian
Requesting Gravatar... Haacked Apr 14, 2005 9:48 AM
# re: Using a Regular Expression to Match HTML
I'm not too familiar with SQL full text query syntax. How do you escape a double quote within the text?

The naive approach is something like:

"[^"]*?"(\s*(and|or)\s*"[^"]*?")*

This is basically matching an expression like:

"some text"

or

"some text" and "some more text"

or

"some text" and "some more" or "even more"

and so on.

This expression doesn't handle escaped double quotes within the text.
Requesting Gravatar... Don Apr 22, 2005 10:59 AM
# re: Using a Regular Expression to Match HTML
How can I get rid of a comma and a paranthesis inside a tag.
eg.

<p(,)> this is the function calculate() </p(,)>


after replace it should look like:

<p> this is the function calculate() </p>
Requesting Gravatar... Haacked Apr 22, 2005 11:37 AM
# re: Using a Regular Expression to Match HTML
Are commas allowed within the attributes of a tag? For example:

<p title="This has a comma, right here.">

Should the regex strip that comma as well? If so it's much easier than if not.

Phil
Requesting Gravatar... Don Apr 22, 2005 1:08 PM
# re: Using a Regular Expression to Match HTML
yes if there is a comma inside < and > tag or < and /> it should be removed.

thanks

right now i'm using [(),] which get rid of them even in the value tag.

<p(,)> this is the function, calculate() </p(,)>


after replace it should look like:

<p> this is the function, calculate() </p>
Requesting Gravatar... Haacked Apr 22, 2005 1:26 PM
# re: Using a Regular Expression to Match HTML
Just to be clear, in my example above, you'd want the comma removed from the title attribute as well, right?

If so, I'd just use the html expression to match tags
</?\W+((\S+\W+(\S*=\S*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>

Once you have the tag, replace , with empty space.

Much easier to take a two-step approach on this one.
Requesting Gravatar... you've been HAACKED Apr 22, 2005 3:14 PM
# Matching HTML With Regular Expressions Redux
Requesting Gravatar... Haacked Apr 22, 2005 3:18 PM
# re: Using a Regular Expression to Match HTML
Ah, I see the problem. Commas aren't allowed in valid HTML, so my expression won't match it. However I have a code snippet that will. I haven't tested it super thoroughly, but it worked on some samples I threw at it. It won't remove commas within an attribute value.

public string RemoveCommas(string html)
{
Regex regex = new Regex(@"</?\w+(((\s|\n|,)+\w+((\s|\n|,)*=(\s|\n|,)*(?:"".*?""|'.*?'|[^'"">\s]+))?)+(\s|\n|,)*|(\s|\n|,)*)/?>", RegexOptions.Singleline);

int lastIndex = 0;
StringBuilder result = new StringBuilder();
MatchCollection matches = regex.Matches(html);
foreach(Match match in matches)
{
result.Append(html.Substring(lastIndex, match.Index - lastIndex));
result.Append(match.Value.Replace(",", ""));
lastIndex = match.Index + match.Value.Length;
}
result.Append(html.Substring(lastIndex));
return result.ToString();
}
Requesting Gravatar... Don Apr 22, 2005 3:45 PM
# re: Using a Regular Expression to Match HTML
I like the function method. Although it didn't replace any commas.

But let me tell you wat I really want. I'm reading an xml file line by line. And some xml tags has commas.
So I want to delete those commas. That's only in name tag. If a comma in value eg. <something>bla bla, bla</something>
it should leave intact.

Here's the code I have right now:

string theThingToReplaceWith = "";
Regex exp = new Regex(@"[,]");


try
{
while(myStreamReader.Peek() != -1)
{

theLine = myStreamReader.ReadLine();
theLine = exp.Replace(theLine,theThingToReplaceWith);
myArrayList.Add(theLine);
}
}
catch(EndOfStreamException eose)
{
Response.Write(eose.ToString());
}
Requesting Gravatar... Help Me May 21, 2005 4:53 AM
# re: Using a Regular Expression to Match HTML
how to repalce alphanumeric charecters between a tag with a "-"
Eg:
<Tag>Aapple's Great % Fruit</Tag>
<Tag>Aapple's Bad # Fruit</Tag>
should become
<Tag>Aapple-s Great - Fruit</Tag>
<Tag>Aapple-s Bad - Fruit</Tag>
Requesting Gravatar... Mark Fletcher May 26, 2005 12:41 PM
# re: Using a Regular Expression to Match HTML
Hi,

Good article! I was wondering if anyone can recommend a strategy for parsing HTML for a web spider? I want it to be able to match links, and find links in javascript. So far Ive been using one expression to extract the links. However with javascript code embedded in a file, I think Id be better off running two passes -

1) Grab the links in tags
2) Make a pass for anything in javascript tags

What do you think?
Requesting Gravatar... Robin Jun 05, 2005 5:02 AM
# re: Using a Regular Expression to Match HTML
tried this using php and preg_replace, comes up with error:

Warning: Unknown modifier '\' in /var/www/html.php on line 20

line 19: $reg = '</?\W+((\S+\W+(\S*=\S*(?:".*?"|'."'".'.*?'."'".'|[^'."'".'">\s]+))?)+\s*|\s*)/?>';

line 20: preg_match_all($reg, $html, $matches);
Requesting Gravatar... haacked Jun 05, 2005 11:35 AM
# re: Using a Regular Expression to Match HTML
Hi Robin, this syntax is particular to .NET, so I'm not sure if you have to make some modifications for PHP.

Also, the expression you're trying is not correct. The corrected expression is at the following URL (http://haacked.com/archive/0001/01/01/2784.aspx).
Requesting Gravatar... Reddy Jun 10, 2005 1:21 PM
# re: Using a Regular Expression to Match HTML
Please let me know the regular expression to detect one or more only white spaces. My requirement is I should raise error message if some body enters only white spaces in the Order name text box.
Requesting Gravatar... Pasan Jun 12, 2005 9:15 AM
# re: Using a Regular Expression to Match HTML
This article is good.
How can we use a regular expression to read text within specific html tags?
Also if we want to read text within the first parapgraph tags of a web page how can we use a regular expression to do tha?
Requesting Gravatar... Vadym May 12, 2006 10:00 AM
# re: Using a Regular Expression to Match HTML
Brilliant! But how about case like this:
<td> Some weird <text> goes here </td> ? Expression above will treat <text> like tag and replace with "". To prevent this from happening I loop through the string and get anything that is not a tag but looks like tag (I have to match <text> to collection of valid tags). Save all occurrences in temp variables (something like #0001=’<text>’, #002=’<text 2>’ etc…). After I run regexp above to strip all HTML tags. And then loop again and replace all temp variables with corresponding values. Pain a bit ;-) So my output looks like:
Some weird <text> goes here. Any better ideas?
Requesting Gravatar... Haacked May 12, 2006 10:12 AM
# re: Using a Regular Expression to Match HTML
The problem is since <text> is not escaped, it probably won't get rendered by the browser, so stripping it shouldn't be a problem.

Ideally, you wouldn't allow that sort of thing. You'd HTML encode it so it looks like &lt;text&gt;
Requesting Gravatar... Kjeks Jun 01, 2006 5:40 AM
# re: Using a Regular Expression to Match HTML
Since the question came up: HTML tags are effectively matched with the regexp

<!--.*?-->

.*? provides non-greedy matching. Use single line option for good karma :-)
Requesting Gravatar... rubendj Jun 06, 2006 2:57 AM
# re: Using a Regular Expression to Match HTML
<!--.*?--> doesn't work in cases like:

<!--comments1--> visible text <!--comments2-->

because it matchs entire line.

This reg. exp. works better: <!--[^(-->)]*-->
Requesting Gravatar... Kjeks Jun 15, 2006 2:09 AM
# re: Using a Regular Expression to Match HTML
Actually, <!--.*?--> works very well for the case you're describing.

The '?' in '.*?' makes the match non-greedy, meaning that as few characters as possible will be matched. This works in Perl and Python, at least. Other languages may have another syntax.
Requesting Gravatar... Freddy Jul 07, 2006 2:09 AM
# re: Using a Regular Expression to Match HTML

Here is a regex that will accept form input, as long as it doesn't contain any HTML or JSP comment tags:
^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$

It works on the client-side (JavaScript RegEx engine) but fails on the server side with an "Invalid RegEx" error.

Does anyone know how to write the expression so that it will validate against the XML specification for RegEx?

Thanks,
---Freddy

More details follow:

I need a regex that will reject form data that contains HTML or JSP comments.

I am contstrained to entering the pattern into a proprietery xml editor.

I don't have the option of using substitution or NOT operators - only the expression itself.

Page 198 in the Perl Cookbook provides the basic format for a NOT regex:
^(?:(?!PATTERN_NOT_TO_MATCH).)*$

And the RegExLib provided the HTML comment pattern that I want to dis-allow:
<!--[\s\S]*?--[ \t\n\r]*>

My Sax parser wont allow a less-than character in the expression, so I will leave it out.

I also want to dis-allow JSP comments, so I added a % as an alternate to the ! at the beginning and as a member of the character set at the end.
^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$

It works fine on the client-side, but fails on the server-side.

How do I write an equivelant regex that will validate against the XML spec?

Requesting Gravatar... Shawn Aug 09, 2006 1:53 AM
# re: Using a Regular Expression to Match HTML
^(?:(?!(!|%)[\s\S]*?--[ %\t\n\r]*>).)*$

*should* also find the CDATA used in XML to escape HTML characters. This is untested, but it builds off

<![\s\S]*?--[ \t\n\r]*>

which is what I use (fully tested on thousands of pages, but I'm not looking for JSP comments)
Requesting Gravatar... John Sep 10, 2006 11:07 AM
# re: Using a Regular Expression to Match HTML
You are a RegEx genius!

Is it possible to enhance the expression to get a reference to the inner html if present, something like (?<innerHtml>)

For example, for the value:
<title>My Title</title>

I could retrieve the value "MyTitle"

Thanks!
Requesting Gravatar... JMARIN Oct 25, 2006 8:57 AM
# re: Using a Regular Expression to Match HTML
Hello!!
does it works in ASP.NET HTML Tag like:
<asp:Textbox>
Requesting Gravatar... santhosh Nov 20, 2006 12:26 AM
# re: Using a Regular Expression to Match HTML
I want to write a regular expression for commenting out all the script tags in a page which is not present inside the comment block.

For ex:
1)

some text here.....
<script type="text/javascript">
alert("Hi santhosh");
alert("Bye....");
</script>
some text here......


I should comment the script section above. The output should be


some text here.....
<!--script type="text/javascript">
alert("Hi santhosh");
alert("Bye....");
</script-->
some text here......

2) But i should not comment for the following scenario. The portion of code is already inside the comment.


some text here.....
<!--
some text here.....
<script type="text/javascript">
alert("Hi santhosh");

alert("Bye....");
</script>
some text here......
-->
some text here......


I have written the regex for commenting all the script tags.
The following is the regex i used

<script(.*?)>((.|\n)*?)(<\/script>)

Please help me.....
Requesting Gravatar... Haacked Nov 20, 2006 9:19 AM
# re: Using a Regular Expression to Match HTML
Two options I can think of.

1. Before processing, remove commented out scripts. For example: Replace:

"<!--<script>" with "<script>"

and </script>--> with "</script>"

And then run your original replacement.

OR

Use the negative lookahead and negative lookbehind to make sure the script you're replacing doesn't already have comments.
Requesting Gravatar... Milenko Curcin Dec 08, 2006 7:31 AM
# re: Using a Regular Expression to Match HTML
I'm trying to make a good regexp for finding html tags so that i could colour them (program for colouring code) and there is one problem with this regexp, it will mach <br> inside <pre> like for example here

<pre>some text<br>some more text</pre>

and this shouldn't happen, <br> is not a tag anymore, it is a text. my knowledge of regexp is not big and i don't know how to fix this :(
Requesting Gravatar... Ashly Dec 20, 2006 3:46 AM
# re: Using a Regular Expression to Match HTML
Hi,

I am working with PHP 5.0
I have a string like this:


[code]
function getSize($userIDArr)
{
$arrSize = sizeof($userIDArr);
}

[code]
function getUserNames($userIDArr)
{
for($i=0; $i < sizeof($userIDArr); ++$i)
{
$userNamesArr[$i] = $userIDArr[$i];
}
return $userNamesArr;
}
[/code]

[/code]



I need to replace the outer tag [code] [/code] pair with <abc> </abc>


The result should look like:


<abc>
function getSize($userIDArr)
{
$arrSize = sizeof($userIDArr);
}

[code]
function getUserNames($userIDArr)
{
for($i=0; $i < sizeof($userIDArr); ++$i)
{
$userNamesArr[$i] = $userIDArr[$i];
}
return $userNamesArr;
}
[/code]

</abc>


If anyone have any idea, please help me..

Thanks in advance

Ashly
Requesting Gravatar... Ashly Dec 20, 2006 3:52 AM
# re: Using a Regular Expression to Match HTML
Hi

My email id is:

meetashly@yahoo.com

Thanks
Ashly
Requesting Gravatar... SteveLionbird Dec 29, 2006 7:36 AM
# re: Using a Regular Expression to Match HTML
Kjeks..

.*? provides non-greedy matching. Use single line option for good karma :-)


Excellent tip .. I'm a CF developer and could not find that documented. CF does not support lookbehinds so I was pulling my hair out trying to grab content between matching opening and closing tags accurately.

Thanks,
Steve
Requesting Gravatar... Nokturnal Jan 17, 2007 10:40 AM
# re: Using a Regular Expression to Match HTML
Can anyone port this over to work within javascript?

<\/?(?!strong|b|i|em)\w+((\s+\w+(\s*=\s*(?:"(.|\n)*?"|'(.|\n)*?'|[^'">\s]+))?)+\s*|\s*)\/?>

Cheers and thanks for the great tutorial here!
Requesting Gravatar... Nokturnal Jan 17, 2007 10:51 AM
# re: Using a Regular Expression to Match HTML
Oh man, forget what I typed above. The error was in the actual implementation of the javascript itself.

Cheers and sorry to waste your time :)
Requesting Gravatar... lb Apr 11, 2007 7:26 PM
# re: Using a Regular Expression to Match HTML
>You’ll have to use the SingleLine RegexOption for it to work

i had this same issue with a regex yesterday... how counter-intuitive is it that you use the singleline option to match a pattern that spans multiple lines?

i guess their thinking is that the singleline option means "treat this input as a single line" -- but that's not how we think when trying to get a regex to work. We think more along the lines of 'i want my pattern to match, even when it spans multiple-lines'

you know i wish i'd found this blog post yesterday... i had many of the same issues: the need for non-greedy matches, the need for the single line option... but my regex-fu is weak. i got there in the end.

lb
Requesting Gravatar... Robert Apr 20, 2007 4:58 AM
# re: Using a Regular Expression to Match HTML
I am versed in regexs so here is an idea I am testing for php. It matches doctype, open, close, and comment tags. It isn't perfect but it gets closer to html rules. This fixes the bug where a tag could be </div /> and is multi-line compatible.

preg version:

(
<\!\w+(?:\s+[^>]*?)+\s*>|
<\w+(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^"'>\s]+))?)*\s*/?>|
</\w+\s*>|
<\!--[^-]*-->
)


php version:

$pattern = "/(".
"<\!\w+(?:\s+[^>]*?)+\s*>|".
"<\w+(?:\s+\w+(?:\s*=\s*(?:\"[^\"]*\"|'[^']*'|[^\"'>\s]+))?)*\s*\/?>|".
"<\/\w+\s*>|".
"<\!--[^-]*-->".
")/";
@preg_match_all($pattern, $filedata, $matchesArray);
var_dump($matchesArray);
Requesting Gravatar... Chris Apr 22, 2007 1:20 PM
# re: Using a Regular Expression to Match HTML
Hi Folks,
I'm trying to get plain text out of "any" html document, which seems to be quite difficult to do cause most of the regex I tried doesn't match scripts, styles and so on. Any suggestions to do it in a different way, I'm programming in C# .NET

Thanks in advance and sorry for my poor english
Requesting Gravatar... Conrad de Wet Apr 24, 2007 11:53 AM
# re: Using a Regular Expression to Match HTML
Hi,
Been searching around to find Javascript that will convert <TAG attrib=xyz> into <TAG attrib="xyz">

I would imagine this code you have developed is close im just not sure how to use it.

Reason: Using a CMS for editing and creating HTML with the output as XML to flash. The textField.htmlText does not support unquoted attributes.

Any assistance appreciated.
Thanks
Requesting Gravatar... Haacked Apr 24, 2007 6:27 PM
# re: Using a Regular Expression to Match HTML
The easiest way to do this is to download the Subtext Source Code and do a search for the method ConvertHtmlToXHtml.

You'll need to add a reference to SgmlReaderDll.dll which is included in the source.
Requesting Gravatar... anon May 04, 2007 12:39 PM
# re: Using a Regular Expression to Match HTML
Great regex, but just to note it doesn't work with attributes that do not have quotes:

[div onclick=alert("hacked") ]

I've been messing with the regex as it is and there doesn't seem to be an easy way to support all three styles.

any ideas?
Requesting Gravatar... Carlos May 28, 2007 5:48 PM
# re: Using a Regular Expression to Match HTML
How can I retrieve only the text that is between the tags <title> and </title>?
thanks
Requesting Gravatar... NeoGeo Jun 05, 2007 5:41 AM
# re: Using a Regular Expression to Match HTML
Can someone help me? I would like to search for script tags. I would like to have it check either the opening tag or closing tag or both. I just need to search if the string have a script tag.

Thank you in advance.
Requesting Gravatar... Shail Jun 15, 2007 9:08 AM
# re: Using a Regular Expression to Match HTML
I am not strong in regular express. I am stuck while validation one string. Actually I want test or test or test or any string. Can you help me to generate regex for this validation. Thaxs
Requesting Gravatar... Brad Jul 09, 2007 2:09 PM
# re: Using a Regular Expression to Match HTML
I don't see any use in this regex.. I'm trying to do something more advanced I havn't seen done yet.

I'm matching HTML tags too.. but the whole tag not just the start tag. I want the start and the end tag. I havn't been able to figure out how to do the recursive part, only a fixed amount of levels, like 10.

Meaning I want to match the whole outer tag of..
<div id="a"><div><div></div></div></div>

So I search for the tag of id="a" and it returns the whole contents even the subed tags. I'm pretty much done just can't figure out the infinite nested duplicate tags. GLGL.
Requesting Gravatar... Sagar Jul 11, 2007 1:50 PM
# re: Using a Regular Expression to Match HTML
I am trying to super-optimize an html output (getting rid of 2 or more spaces in the response ) but failed to construct a suitable regex, can any body help?
Requesting Gravatar... Luke-Jr Sep 06, 2007 3:56 PM
# Invalid HTML
I realize I'm a bit late seeing this post, but....
&lt;img title="displays &gt;" src="big.gif"&gt;
happens to be INVALID HTML in the first place. &gt; is not allowed even in attributes!
The correct code, which does NOT break your original regex is:
&lt;img title="displays &amp;gt;" src="big.gif"&gt;
Requesting Gravatar... cioman Oct 11, 2007 5:14 AM
# re: Using a Regular Expression to Match HTML
Sorry! Re-posting, because the previous post didn't show up well:
---


</?[a-z][a-z0-9]*[^<>]*>

can strip an html file off all tags.

Can we prevent it from searching the following tags:

a, i, b, p, sup and their closing tag equivalents (i have removed <> as the tags weren't displaying when I posted the query earlier)

The application of such a regex would be to preserve all the tags that make a structure of the text, with minimal formatting.

What would be better if one could replace 'i' with 'em' and 'b' with 'strong' (replace '' with <> please, I was having problems posting this query to the forum).

Any ideas??

Thanks, in advance.
Requesting Gravatar... Jesse Morrow Oct 15, 2007 11:11 PM
# Using a Regular Expression to Validate Balanced HTML Tags
I allow the user to submit customized header and footer HTML through a web form. While I'm not too concerned with what they do with the HTML or even if it contains valid tags it is important that their tags be well formed and balanced (i.e. properly matched with ending tags) so that their potentially bad HTML doesn't cause the rest of the web site to get trampled.

I made a Javascript function which takes an HTML snippet as a string and returns true if the HTML is well formed and all tags are properly balanced.

I took the regular expression given here and extended it to match:

1) any opening tag, its *text* content, and its corresponding closing tag,
2) any self-closing tag - such as <br />, <input />,
3) any HTML comments, or
4) pure text (i.e. no HTML tags)

The trick is taking into account the nested nature of HTML which regular expressions aren't expressive enough to match. My trick is to iteratively replace each matched portion with nothing - thus stripping it from the HTML string until it has been stripped down to an empty string. If the loop can strip the HTML iteratively to an empty string then it must be valid and all tags balanced. If the loop hits a point where nothing new is being matched and stripped and yet the string is still not empty then the HTML is invalid or unbalanced.

The reason this works is because each loop iteration strips off all the most deeply nested elements which have no child elements (leaf elements) thus leaving their parent elements as leaf elements for the next iteration.

Here is the code:

var regex = /[^<>]*<(\w+)(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>[^<>]*<\/\1+\s*>[^<>]*|[^<>]*<\w+(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/>[^<>]*|<!--.*?-->|^[^<>]+$/ig;

function validate(html) {
var v = html;
do {
html = v;
v = html.replace(regex, '');
} while( v != html)
return v.length==0;
}

The loop structure and iterative concept is totally stable. The only thing which might need some refinement is the regular expression as I haven't thought too deeply about all the comment and line return possibilities.

Jesse
Requesting Gravatar... Jason Oct 27, 2007 12:02 AM
# re: Using a Regular Expression to Match HTML
This is in regards to this comment.

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" id="something">
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Those tags don't get matched because of the non \w characters ":" and "-".

It should be changed to...

$pattern = "/(".
"<\!\w+(?:\s+[^>]*?)+\s*>|".
"<\w+(?:\s+\w+([^>]*)(?:\s*=\s*(?:\"[^\"]*\"|'[^']*'|[^\"'>\s]+))?)*\s*\/?>|".
"<\/\w+\s*>|".
"<\!--[^-]*-->".
")/i";
Requesting Gravatar... Alex Nov 06, 2007 11:53 AM
# re: Using a Regular Expression to Match HTML
First off, thanks for the original regular expression, </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>. It took me seconds, maybe a minute to find what I needed doing a google search. It took me much longer to get this working in my application using JavaScript. Since it was a PITA for me, I am going to show how I did it, because I would have been thankful to find this.

function ContainsHtml(inputText)
{
var htmlRegex = new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);

if(inputText.match(htmlRegex))
{
return true;
}
return false;
}

The difficult part was setting this string </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?> in JavaScript with quotation marks and forward and backslashes. I started off by surrounding the initial regex string with quotation marks, then escaping the appropriate characters, but this did not work.
var re = "</?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)/?>";

After some time, I gave up and when this this approach, which worked.
new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);

Basically, I was escaping the wrong characters. Hope this helps, and thanks again for the regex Phil.
Requesting Gravatar... sums Nov 16, 2007 7:51 AM
# re: Using a Regular Expression to Match HTML
Hi,
I was reading through your posts.
Can you please help me with this problem:(Which is pretty simple but I do not know regular expressions)

I need to define a rule where anypage.aspx converts to anypage (to define in web.config)

ex: www.xyz.com/page1.aspx or www.xyz.com/page2.aspx etc. beocomes
www.xyz.com/page1 and www.xyz.com/page2 respectively.


Looking forward to hearing from you soon.
Thanks..
Requesting Gravatar... Jinesh Shah Feb 22, 2008 1:49 AM
# re: Using a Regular Expression to Match HTML
Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
i hv string like :

string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"

now Using Regular Expression i want value of name attribute not others...so can anybody give me dat regular expression..

ans must be : t1 t2
not f1 or submit ...
i hv tried lots of expression bt still cant find the final one....

Thank u
Jinesh
Requesting Gravatar... Jinesh Shah Feb 22, 2008 2:08 AM
# re: Using a Regular Expression to Match HTML
Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
i hv string like :

string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"

now Using Regular Expression i want value of name attribute not others...so can anybody give me dat regular expression..

ans must be : t1 t2
not f1 or submit ...
i hv tried lots of expression bt still cant find the final one....

Thank u
Jinesh
Requesting Gravatar... mj Apr 24, 2008 1:36 PM
# re: Using a Regular Expression to Match HTML
hi i am doing an assignment and need help!!!!
i am parsing html using c#.net and need help finding img, object, and applet tags using regular expressions. please help!!!
Requesting Gravatar... Mic May 01, 2008 4:18 PM
# re: Using a Regular Expression to Match HTML
Hi

I am using preg_split to split words...and then trying to match using
preg_grep, how do i match html tags?

Thanks
Requesting Gravatar... tim May 11, 2008 7:18 PM
# re: Using a Regular Expression to Match HTML
I'm lazy and had been using

&lt;(/)?(b|i|u|sup|sub|small|big)([^&gt;]*)&gt;

to match tags. It works pretty well for me since I don't care what's in the tag itself like attributes, but fails on self closing tags. This is a problem for another day.

I thought that I could match everything except the list by adding a "not" operator (^) inside those brackets:

&lt;(/)?(^b|i|u|sup|sub|small|big)([^&gt;]*)&gt;

but it doesn't work. Somebody has made off with my regular expression reference book and the "tutorials" I'm reading online are leaving me more confused than if I just brute force it myself.

Any assistance would be appreciated
Requesting Gravatar... Carros DF May 20, 2008 12:10 PM
# re: Using a Regular Expression to Match HTML
<!--[\s\S]*?--[ \t\n\r]*> work nice on DreamWeaver CS to remove comments via SEARCH AND REPLACE box.
Requesting Gravatar... Alexander Thorell Jun 17, 2008 1:39 AM
# re: Using a Regular Expression to Match HTML
Hi and thanks Haack, but your expression \s]+))?)+\s*|\s*)/?> do not seem to match tags with attrubutenames containing hyphen (-) in it, like
Requesting Gravatar... Alexander Thorell Jun 17, 2008 4:21 AM
# re: Using a Regular Expression to Match HTML
Sorry, but the expression got truncated, but i'm reffering to the updated expression in you earlier comment. Se if works now...
\s]+))?)+\s*|\s*)/?>
Requesting Gravatar... Frank Dase Jun 27, 2008 2:31 AM
# re: Using a Regular Expression to Match HTML
I need an expression for a searchengine to highlight the word I searched for. But I have to ignore matches inside HTML tags.

for example:

http://www.test.com/laser">this is a laser test...

I want only to match "laser" outside the base tag.

I'm a noop in reg expression, so I hope you can help me. I need it for classic ASP.
Requesting Gravatar... wow.. your expression is good Jul 02, 2008 2:17 AM
# re: Using a Regular Expression to Match HTML
Thank you~^^
Requesting Gravatar... moo Jul 02, 2008 10:08 AM
# re: Using a Regular Expression to Match HTML
While RegExs are good for a variety of purposes, people really need to think about using the HTML DOM for most of their processing needs. If you're just trying to get elements from inside tags or information, using the DOM tends to be easier and more extensible.
Requesting Gravatar... Skyetech Jul 22, 2008 8:24 AM
# re: Using a Regular Expression to Match HTML
Frank D,
I'm trying to do exactly what you're doing with the search engine results. Did you get a solution to your problem?
Requesting Gravatar... Jamey Taylor Jul 22, 2008 11:26 AM
# re: Using a Regular Expression to Match HTML
Found a case it doesn't handle:
It doesn't match if the attribute name has a non-word character. For example, the hyphen in the first attribute below causes it to not match:



This has been a huge help to me in maintaining hundreds of non-xhtml compliant pages. Thanks!!!
Requesting Gravatar... yuree Jul 23, 2008 12:00 AM
# re: Using a Regular Expression to Match HTML
I want to strip non digits and first digit, any one can help?
I use "/[^\d]/g" to remove non digits, but I want to remove first digit too, e.g.
23,185

by using "/[^\d]/g", i see 823185, but i want to trim first digit.

I just want to display "23185", any clue?
Requesting Gravatar... Craig Laparo Aug 03, 2008 4:25 PM
# re: Using a Regular Expression to Match HTML
I'm a student in the Google Summer of Code program, working for Dojo (http://dojotoolkit.org) and I came across your awesome regular expression for catching HTML tags. I'd like to use it in my code, if it's ok with you. Do you have a CLA?
Requesting Gravatar... tp Sep 22, 2008 8:44 AM
# re: Using a Regular Expression to Match HTML

Hi Friends,

EDITOR: Comment Removed because of HTML formatting issues. Sorry.

Requesting Gravatar... Volkan Vardar Oct 09, 2008 2:02 AM
# re: Using a Regular Expression to Match HTML
<.*?>
should do the same...
Requesting Gravatar... Pencheff Oct 10, 2008 6:06 AM
# re: Using a Regular Expression to Match HTML (Delphi)
For those also working with Delphi and TRegExpr library, here's my Regex for parsing HTML tags (it's based on Phil's one):

RegexParser.Expression := '(?i)<(/?\w+)((\s+(\w+)(\s*=\s*("(.*?)"|[^''">\s]+))?)+\s*|\s*)/?>';

It does return the tag (b, /b, font, /font, etc) in Match[1]
the parameter (size, color, etc) in Match[4] and
the parameter value in Match[7]
Requesting Gravatar... sfsdfs Jan 07, 2009 11:12 PM
# re: Using a Regular Expression to Match HTML
dsfdsfdsfdsfds
f
dsf
ds
f
dsf
Requesting Gravatar... smith Jan 10, 2009 4:30 PM
# re: Using a Regular Expression to Match HTML
http://www.cyworld.com/colap/2360326

(<([\/@!?#]?[^\W_]+)(?:\s|(?:\s(?:[^'">\s]|'[^']*'|"[^"]*")*))*>)|(<\!--[^-]*-->)

is better good.
Requesting Gravatar... Joe Bob Jan 10, 2009 6:42 PM
# re: Using a Regular Expression to Match HTML
HTML is not regular and hence can't be reliably parsed by regular expressions. Please see the following site for more information:

http://htmlparsing.icenine.ca
Requesting Gravatar... robert Mar 04, 2009 12:11 AM
# re: Using a Regular Expression to Match HTML
i need one regular expression that it show words using code html
Requesting Gravatar... Jamp Mark Mar 05, 2009 11:10 PM
# RegExp to Match Anchor Tag HREF URL
Here is a regex pattern to capture the URL in anchor tag HREF.

/<a\s+href="([^>]+?)"/
Requesting Gravatar... Mo Mar 09, 2009 7:05 AM
# re: Using a Regular Expression to Match HTML
Hi,

I need to match double quotes within a string, when they are not within html-tags. What ist the pattern to match "up" but not "highlight"?

Hello <span class="highlight">Peter</span>, what's "up"?

I'm using a VB-RegEx-Engine...
Requesting Gravatar... Andrei Mar 10, 2009 2:55 PM
# re: Using a Regular Expression to Match HTML
I am trying to match and replace links in anchor tags. It works with /<a\s+href="([^>]+?)"/ but i need to replace only what is in the href, the url, not all the anchor element..
Requesting Gravatar... Vaibhav Mar 12, 2009 12:02 AM
# re: Using a Regular Expression to Match HTML
i need a regular expression for removing self closing html tag
Requesting Gravatar... get entertainment Mar 18, 2009 12:34 PM
# re: Using a Regular Expression to Match HTML
How do you get the inner text using regex? what if there is several matching tags i.e. <ul><li>content</li><li>content</li>etc. </ul>
Requesting Gravatar... Black Mar 27, 2009 10:43 PM
# re: Using a Regular Expression to Match HTML
I want to get content between this text, but I don't know how to get it by regular

<helloworld>
thifkdslaj
fdskjaflksdj
fasjlk&8*()*$)#(
$*())(#
</helloworld>

So, how can I get content between helloworld tag by Regular Expression?Because between them, there are a lot of line break
and the text which I have is too big+long,so I cann't use replace function
thanks
thanks
Requesting Gravatar... Steve Tattersall Mar 22, 2010 3:36 AM
# re: Using a Regular Expression to Match HTML
I am wanting to remove lines of html code beginning with the start tag <DOCTYPE ...> to the end tag being the </table> how can I best achieve removing multiple lines html code?

Requesting Gravatar... Pete B Mar 22, 2010 4:51 AM
# re: Using a Regular Expression to Match HTML
Excellent.

Just spent the last hour trying to do this and failing:

/<\/?[a-z0-9-_]+(?:''|[^"'>]*(["']?)[\S\s]*?\1)\/?>/gi


Requesting Gravatar... Wize Mar 28, 2010 12:16 AM
# re: Using a Regular Expression to replace text outside of double quotes
I am trying to write a regular expression that will replace text but not inside of double quotes; and if there aren't any quotes, then it replaces the text

Example: abc "abc" abc
Replace with: def
Expression: (?!")abc(?!")

Result wanted: def "abc" def
Result got: def "abc" def

However, if I have: def "z abc z" def
I get: def "z def z" def
Instead of: def "z abc z" def
using: (?!")abc(?!")

I tried: (?!"\w*)abc(?!\w*")
but got: def "z def z" def

I am hoping to write an expression that will change:
abc abc (to) def def
abc "abc" abc (to) def "abc" def
abc "z abc z" abc (to) def "z abc z" def
Requesting Gravatar... Tass Apr 08, 2010 5:13 PM
# re: Using a Regular Expression to Match HTML
Hi
I have an HTML and i have to replace width:0;height:0;" to width:0px;height:0px;" in the style .

is some have idea what can be regex and replace can be ..


<div class="cool"><img src="http://www.coolchaser.com/images/banner_xray.gif" alt="CoolChaser"></div><img style="visibility:hidden;width:0;height:0;" border=0 width=0 height=0 src="counters.gigya.com/.../bHQ9MTIxNjg3NDUyNjQyMSZwdD*xMjE2ODc*NjA*MDYyJnA9MjEwNjkxJmQ9Jm49bXlzcGFjZSZnPTE=.jpg" />
Requesting Gravatar... Anon Jul 09, 2010 2:16 PM
# re: Using a Regular Expression to Match HTML
stackoverflow.com/.../1732454#1732454
Requesting Gravatar... Flavio Troja Jul 15, 2010 3:04 AM
# re: Using a Regular Expression to Match HTML
I need a regular expression that catch html code between the tags <div class="result">
and <br clear="all">

look like this:

<div class="result">
.... (I whant catch this)
<br clear="all">


can you help me?
Requesting Gravatar... celso Jul 21, 2010 5:15 PM
# re: Using a Regular Expression to Match HTML
How replaceAll myTerm out side of <tag> like

myTerm is <x='z myTerm w'> like myTerm

to

XXXXX is <x='z myTerm w'> like XXXXX

?

What do you have to say?

(will show your gravatar)
Please add 8 and 5 and type the answer here: