Using a Regular Expression to Match HTML

Oct 25, 2004 regex html suggest edit

I just love regular expressions. I mean look at the sample below.

</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)/?>

What’s not to like?

Ok I admit, I was a bit intimidated by regular expressions when I first started off as a developer. All I needed was a Substring method and an IndexOf method and I was set. But after a few projects that required some intense text processing, I realized the power and utility of regular expressions. They should be on the tool belt of every developer. To that end, I recommend Mastering Regular Expressions by Jeffrey Friedl. This is really THE book on Regular Expressions. Reading it will make your Regex-Fu powerful.

So let’s look at a common task of matching HTML tags within the body of some text. When you initially think to parse an HTML tag, it seems quite easy. You might consider the following expression:

</?\w+\s+[\^>]*>

Roughly Translated, this expression looks for the beginning tag and tag name, followed by some white-space and then anything that doesn’t end the tag.

Now this will probably work 99 times out of 100, but there’s a flaw in this expression. Do you see it? What if I asked you to match the following tag?

<img title="displays >" src="big.gif">

Hopefully you see the issue here. The expression will match

<img title="displays >

Unfortunately, this implementation is too naive. We have to consider the fact that the greater-than symbol does not end a tag if it’s within a quoted attribute value. Thus we must correctly match attributes.

Now there are four possible formats for an Html attribute

name="double quoted value" name='single quoted value' name=notquotedvaluewithnowhitespace name

Each of these cases are quite simple. In the first case, you could do the following:

\w+\s*=\s*"[\^"]*"

The portion "[\^"]*" matches a double quote, followed by any non double quote characters, followed by a double quote. Another way to express this is to use lazy evaluation as such:

\w+\s*=\s*".*?"

The portion ".*?" uses lazy evaluation (the “lazy star”) to match as few characters as possible. For example, if we had a string like so

<a name=test value="test2">

evaluating ".*" (aka greedy) would match

"test" value="test2"

However using the lazy evaluation consumes the fewest characters that match the expression, thus the first match using ".*?" would be "test" and the second match is "test2".

The full expression for matching an HTML tag is that lovely mash of characters presented at the very beginning of this post. It’s a modified version of the one presented in Friedl’s book

However I wouldn’t recommend you just plunk that down in your code. Rather, you should consider adding it to a regular expression library assembly.

Don’t know how? Well I’ll show you a code listing for an exe that when run, builds a fully compiled version of this regular expression into an assembly that you can then reference in any project. In a later installment, I’ll explain in more detail just what the code is doing and how to use the compiled assembly. How irresponsible of me not to do that now. ;)

Source Listing

Found a typo or mistake in the post? suggest edit

Comments

129 responses

Dimitri Glazkov • October 25th, 2004
Look at you go, man. That's good stuff. After all of that partisan politics junk, it's refreshing to see a good tech post :)
Haacked • October 25th, 2004
Ha ha ha... Thanks. At this point I think the two undecided people in this world will have to make up their own minds and not have me tell them what to think (though I think I would do a fine job of that).

You should hopefully see some more good techie posts coming up.
Pat • October 25th, 2004
Just to play devil's advocate for a minute, it seems like HTML is just too darned close to XML to have to parse this way. Isn't there a library out there for converting HTML into XHTML? If you can do that, you can just read the file in using XmlDocument::LoadXml(). Once you've done that, you can find your tags using an XPath query. Sorry, I just couldn't let a parsing post go by without tossing in my two cents ;)
Haacked • October 25th, 2004
Pat, you're absolutely right. There is an SGML library as well as the HTML agility pack (http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx).

However, we (RSS Bandit) found that SGML was too heavyweight, poor performing, and even a bit buggy for the simple task of searching HTML for links to RSS Feeds.

I was tasked with replacing SGML with regular expressions and it performs quite well.
Jon Galloway • October 26th, 2004
Great info. I've never really understood the lazy / greedy match thing before. Thanks.

Using RegEx's on nested HTML gets more difficult. We messed with balancing groups, which are supposed to handle nested constructs, we never got it working. Every sample (MSDN, Dan Appleman's book, etc.) was the same nested perintheses thing, but didn't work for HTML.

Did you handle nesting in your HTML parsing?
Haacked • October 26th, 2004
Funny you mention this. I was just talking about this with a friend and on a newsgroup.

Regular expressions just aren't well suited for nested matching. The balanced groupings is a Microsoft innovation to regular expressions, so it's not something I've played around with much.

Since the info I needed was inside a tag, my regular expression works fine for that type of processing. You could also use it to strip all tags from a document.

If I was using it to actually parse an HTML doc (which I have some code that does), I keep track of indices and everytime I match a tag, I record the beginning and end index. Then I compare that with the previous matched tag indices and I grab the content between.
Simon Mourier • October 27th, 2004
What about comments (<!-- blablah -- >). You regex matches links in comments too, right? Same remark with <script> and <style>?

Simon (Just trying to be annoying . You know me :-)
Haacked • October 27th, 2004
Oh! You're eeeevil! You are absolutely right. It would match tags within comments.

It's easy enough to strip out comments before parsing.
Haacked • November 4th, 2004
I found a mistake. For some reason my blogging engine capitalized some characters. Also, if a tag is on multiple lines, the expression above is broken. Here's my updated one.

</?\w+((\s+\w+(\s*=\s*(?:"(.|\n)*?"|'(.|\n)*?'|[^'">\s]+))?)+\s*|\s*)/?>
tester • March 2nd, 2005
<iframe src="www.mic.com"></iframe>some text
Ash • March 4th, 2005
How can I use regex to match a HTML comment

eg: 

thanks
Jim MacDiarmid • April 4th, 2005
Hi,

I'm working on a template parsing engine and I was wondering how I would go about using Regex to capture text(html) between custom tags?

Jim

jim.macdiarmid@comcast.net
Haacked • April 4th, 2005
Hi Jim,

That's actually quite challenging. You could attempt to use .NET's balanced matching mechanism, but it's pretty difficult to get the grasp of and it's not standard Regex.

It really depends what you're trying to accomplish. One way I've done it is to use a regex that matches html tags and then strip all tags out.

Hope that helps.
Adrian • April 13th, 2005
Can anyone help me out? I'm trying to build a regular expression to validate an SQL full text query. So something like:

"cable television" and "wireless bluetooth" OR "3G"

to ensure all booleans operators are outside quotes, and all text is in quotes.

Is this possible with a regular expression? or the wrong approach?

Cheers!

adrian
Haacked • April 13th, 2005
I'm not too familiar with SQL full text query syntax. How do you escape a double quote within the text?

The naive approach is something like:

"[^"]*?"(\s*(and|or)\s*"[^"]*?")*

This is basically matching an expression like:

"some text"

or

"some text" and "some more text"

or

"some text" and "some more" or "even more"

and so on.

This expression doesn't handle escaped double quotes within the text.
Don • April 21st, 2005
How can I get rid of a comma and a paranthesis inside a tag.

eg.

<p(,)> this is the function calculate() </p(,)>

after replace it should look like:

<p> this is the function calculate() </p>
Haacked • April 22nd, 2005
Are commas allowed within the attributes of a tag? For example:

<p title="This has a comma, right here.">

Should the regex strip that comma as well? If so it's much easier than if not.

Phil
Don • April 22nd, 2005
yes if there is a comma inside < and > tag or < and /> it should be removed.

thanks

right now i'm using [(),] which get rid of them even in the value tag.

<p(,)> this is the function, calculate() </p(,)>

after replace it should look like:

<p> this is the function, calculate() </p>
Haacked • April 22nd, 2005
Just to be clear, in my example above, you'd want the comma removed from the title attribute as well, right?

If so, I'd just use the html expression to match tags

</?\W+((\S+\W+(\S*=\S*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>

Once you have the tag, replace , with empty space.

Much easier to take a two-step approach on this one.
Haacked • April 22nd, 2005
Ah, I see the problem. Commas aren't allowed in valid HTML, so my expression won't match it. However I have a code snippet that will. I haven't tested it super thoroughly, but it worked on some samples I threw at it. It won't remove commas within an attribute value.

public string RemoveCommas(string html)

{

Regex regex = new Regex(@"</?\w+(((\s|\n|,)+\w+((\s|\n|,)*=(\s|\n|,)*(?:"".*?""|'.*?'|[^'"">\s]+))?)+(\s|\n|,)*|(\s|\n|,)*)/?>", RegexOptions.Singleline);

int lastIndex = 0;

StringBuilder result = new StringBuilder();

MatchCollection matches = regex.Matches(html);

foreach(Match match in matches)

{

result.Append(html.Substring(lastIndex, match.Index - lastIndex));

result.Append(match.Value.Replace(",", ""));

lastIndex = match.Index + match.Value.Length;

}

result.Append(html.Substring(lastIndex));

return result.ToString();

}
Don • April 22nd, 2005
I like the function method. Although it didn't replace any commas.

But let me tell you wat I really want. I'm reading an xml file line by line. And some xml tags has commas.

So I want to delete those commas. That's only in name tag. If a comma in value eg. <something>bla bla, bla</something>

it should leave intact.

Here's the code I have right now:

string theThingToReplaceWith = "";

Regex exp = new Regex(@"[,]");

try

{

while(myStreamReader.Peek() != -1)

{

theLine = myStreamReader.ReadLine();

theLine = exp.Replace(theLine,theThingToReplaceWith);

myArrayList.Add(theLine);

}

}

catch(EndOfStreamException eose)

{

Response.Write(eose.ToString());

}
Help Me • May 20th, 2005
how to repalce alphanumeric charecters between a tag with a "-"

Eg:

<Tag>Aapple's Great % Fruit</Tag>

<Tag>Aapple's Bad # Fruit</Tag>

should become

<Tag>Aapple-s Great - Fruit</Tag>

<Tag>Aapple-s Bad - Fruit</Tag>
Mark Fletcher • May 26th, 2005
Hi,

Good article! I was wondering if anyone can recommend a strategy for parsing HTML for a web spider? I want it to be able to match links, and find links in javascript. So far Ive been using one expression to extract the links. However with javascript code embedded in a file, I think Id be better off running two passes -

1) Grab the links in tags

2) Make a pass for anything in javascript tags

What do you think?
Robin • June 4th, 2005
tried this using php and preg_replace, comes up with error:

Warning: Unknown modifier '\' in /var/www/html.php on line 20

line 19: $reg = '</?\W+((\S+\W+(\S*=\S*(?:".*?"|'."'".'.*?'."'".'|[^'."'".'">\s]+))?)+\s*|\s*)/?>';

line 20: preg_match_all($reg, $html, $matches);
haacked • June 5th, 2005
Hi Robin, this syntax is particular to .NET, so I'm not sure if you have to make some modifications for PHP.

Also, the expression you're trying is not correct. The corrected expression is at the following URL (https://haacked.com/archive/0001/01/01/2784.aspx).
Reddy • June 10th, 2005
Please let me know the regular expression to detect one or more only white spaces. My requirement is I should raise error message if some body enters only white spaces in the Order name text box.
Pasan • June 11th, 2005
This article is good.

How can we use a regular expression to read text within specific html tags?

Also if we want to read text within the first parapgraph tags of a web page how can we use a regular expression to do tha?
Vadym • May 11th, 2006
Brilliant! But how about case like this:
<td> Some weird <text> goes here </td> ? Expression above will treat <text> like tag and replace with "". To prevent this from happening I loop through the string and get anything that is not a tag but looks like tag (I have to match <text> to collection of valid tags). Save all occurrences in temp variables (something like #0001=’<text>’, #002=’<text 2>’ etc…). After I run regexp above to strip all HTML tags. And then loop again and replace all temp variables with corresponding values. Pain a bit ;-) So my output looks like:
Some weird <text> goes here. Any better ideas?
Haacked • May 11th, 2006
The problem is since <text> is not escaped, it probably won't get rendered by the browser, so stripping it shouldn't be a problem.
Ideally, you wouldn't allow that sort of thing. You'd HTML encode it so it looks like <text>
Kjeks • May 31st, 2006
Since the question came up: HTML tags are effectively matched with the regexp

.*? provides non-greedy matching. Use single line option for good karma :-)
rubendj • June 5th, 2006
 doesn't work in cases like:
 visible text 
because it matchs entire line.
This reg. exp. works better: )]*-->
Kjeks • June 14th, 2006
Actually,  works very well for the case you're describing.
The '?' in '.*?' makes the match non-greedy, meaning that as few characters as possible will be matched. This works in Perl and Python, at least. Other languages may have another syntax.
Freddy • July 6th, 2006

Here is a regex that will accept form input, as long as it doesn't contain any HTML or JSP comment tags:
^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$
It works on the client-side (JavaScript RegEx engine) but fails on the server side with an "Invalid RegEx" error.
Does anyone know how to write the expression so that it will validate against the XML specification for RegEx?
Thanks,
---Freddy
More details follow:
I need a regex that will reject form data that contains HTML or JSP comments.
I am contstrained to entering the pattern into a proprietery xml editor.
I don't have the option of using substitution or NOT operators - only the expression itself.
Page 198 in the Perl Cookbook provides the basic format for a NOT regex:
^(?:(?!PATTERN_NOT_TO_MATCH).)*$
And the RegExLib provided the HTML comment pattern that I want to dis-allow:
<!--[\s\S]*?--[ \t\n\r]*>
My Sax parser wont allow a less-than character in the expression, so I will leave it out.
I also want to dis-allow JSP comments, so I added a % as an alternate to the ! at the beginning and as a member of the character set at the end.
^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$
It works fine on the client-side, but fails on the server-side.
How do I write an equivelant regex that will validate against the XML spec?
Shawn • August 8th, 2006
^(?:(?!(!|%)[\s\S]*?--[ %\t\n\r]*>).)*$
*should* also find the CDATA used in XML to escape HTML characters. This is untested, but it builds off
<![\s\S]*?--[ \t\n\r]*>
which is what I use (fully tested on thousands of pages, but I'm not looking for JSP comments)
John • September 10th, 2006
You are a RegEx genius!
Is it possible to enhance the expression to get a reference to the inner html if present, something like (?<innerHtml>)
For example, for the value:
<title>My Title</title>
I could retrieve the value "MyTitle"
Thanks!
JMARIN • October 24th, 2006
Hello!!
does it works in ASP.NET HTML Tag like:
<asp:Textbox>
santhosh • November 19th, 2006
I want to write a regular expression for commenting out all the script tags in a page which is not present inside the comment block.
For ex:
1)
some text here..... <script type="text/javascript"> alert("Hi santhosh"); alert("Bye...."); </script> some text here......
I should comment the script section above. The output should be
some text here.....  some text here......
2) But i should not comment for the following scenario. The portion of code is already inside the comment.
some text here.....  some text here......
I have written the regex for commenting all the script tags.
The following is the regex i used
<script(.*?)>((.|\n)*?)(<\/script>)
Please help me.....
Haacked • November 19th, 2006
Two options I can think of.
1. Before processing, remove commented out scripts. For example: Replace:
" with "</script>"
And then run your original replacement.
OR
Use the negative lookahead and negative lookbehind to make sure the script you're replacing doesn't already have comments.
Milenko Curcin • December 7th, 2006
I'm trying to make a good regexp for finding html tags so that i could colour them (program for colouring code) and there is one problem with this regexp, it will mach <br> inside <pre> like for example here
<pre>some text<br>some more text</pre>
and this shouldn't happen, <br> is not a tag anymore, it is a text. my knowledge of regexp is not big and i don't know how to fix this :(
Ashly • December 19th, 2006
Hi,
I am working with PHP 5.0
I have a string like this:

[code]
function getSize($userIDArr)
{
$arrSize = sizeof($userIDArr);
}
[code]
function getUserNames($userIDArr)
{
for($i=0; $i < sizeof($userIDArr); ++$i)
{
$userNamesArr[$i] = $userIDArr[$i];
}
return $userNamesArr;
}
[/code]
[/code]

I need to replace the outer tag [code] [/code] pair with <abc> </abc>

The result should look like:

<abc>
function getSize($userIDArr)
{
$arrSize = sizeof($userIDArr);
}
[code]
function getUserNames($userIDArr)
{
for($i=0; $i < sizeof($userIDArr); ++$i)
{
$userNamesArr[$i] = $userIDArr[$i];
}
return $userNamesArr;
}
[/code]
</abc>

If anyone have any idea, please help me..
Thanks in advance
Ashly
Ashly • December 19th, 2006
Hi
My email id is:
meetashly@yahoo.com
Thanks
Ashly
SteveLionbird • December 28th, 2006
Kjeks..

.*? provides non-greedy matching. Use single line option for good karma :-)

Excellent tip .. I'm a CF developer and could not find that documented. CF does not support lookbehinds so I was pulling my hair out trying to grab content between matching opening and closing tags accurately.
Thanks,
Steve
Nokturnal • January 16th, 2007
Can anyone port this over to work within javascript?
<\/?(?!strong|b|i|em)\w+((\s+\w+(\s*=\s*(?:"(.|\n)*?"|'(.|\n)*?'|[^'">\s]+))?)+\s*|\s*)\/?>
Cheers and thanks for the great tutorial here!
Nokturnal • January 16th, 2007
Oh man, forget what I typed above. The error was in the actual implementation of the javascript itself.
Cheers and sorry to waste your time :)
lb • April 11th, 2007
>You’ll have to use the SingleLine RegexOption for it to work
i had this same issue with a regex yesterday... how counter-intuitive is it that you use the singleline option to match a pattern that spans multiple lines?
i guess their thinking is that the singleline option means "treat this input as a single line" -- but that's not how we think when trying to get a regex to work. We think more along the lines of 'i want my pattern to match, even when it spans multiple-lines'
you know i wish i'd found this blog post yesterday... i had many of the same issues: the need for non-greedy matches, the need for the single line option... but my regex-fu is weak. i got there in the end.
lb
Robert • April 19th, 2007
I am versed in regexs so here is an idea I am testing for php. It matches doctype, open, close, and comment tags. It isn't perfect but it gets closer to html rules. This fixes the bug where a tag could be </div /> and is multi-line compatible.
preg version:
(
<\!\w+(?:\s+[^>]*?)+\s*>|
<\w+(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^"'>\s]+))?)*\s*/?>|
</\w+\s*>|
<\!--[^-]*-->
)

php version:
$pattern = "/(".
"<\!\w+(?:\s+[^>]*?)+\s*>|".
"<\w+(?:\s+\w+(?:\s*=\s*(?:\"[^\"]*\"|'[^']*'|[^\"'>\s]+))?)*\s*\/?>|".
"<\/\w+\s*>|".
"<\!--[^-]*-->".
")/";
@preg_match_all($pattern, $filedata, $matchesArray);
var_dump($matchesArray);
Chris • April 22nd, 2007
Hi Folks,
I'm trying to get plain text out of "any" html document, which seems to be quite difficult to do cause most of the regex I tried doesn't match scripts, styles and so on. Any suggestions to do it in a different way, I'm programming in C# .NET
Thanks in advance and sorry for my poor english
Conrad de Wet • April 24th, 2007
Hi,
Been searching around to find Javascript that will convert <TAG attrib=xyz> into <TAG attrib="xyz">
I would imagine this code you have developed is close im just not sure how to use it.
Reason: Using a CMS for editing and creating HTML with the output as XML to flash. The textField.htmlText does not support unquoted attributes.
Any assistance appreciated.
Thanks
Haacked • April 24th, 2007
The easiest way to do this is to download the Subtext Source Code and do a search for the method ConvertHtmlToXHtml.
You'll need to add a reference to SgmlReaderDll.dll which is included in the source.
anon • May 4th, 2007
Great regex, but just to note it doesn't work with attributes that do not have quotes:
[div onclick=alert("hacked") ]
I've been messing with the regex as it is and there doesn't seem to be an easy way to support all three styles.
any ideas?
Carlos • May 28th, 2007
How can I retrieve only the text that is between the tags <title> and </title>?
thanks
NeoGeo • June 4th, 2007
Can someone help me? I would like to search for script tags. I would like to have it check either the opening tag or closing tag or both. I just need to search if the string have a script tag.
Thank you in advance.
Shail • June 14th, 2007
I am not strong in regular express. I am stuck while validation one string. Actually I want test or test or test or any string. Can you help me to generate regex for this validation. Thaxs
Brad • July 9th, 2007
I don't see any use in this regex.. I'm trying to do something more advanced I havn't seen done yet.
I'm matching HTML tags too.. but the whole tag not just the start tag. I want the start and the end tag. I havn't been able to figure out how to do the recursive part, only a fixed amount of levels, like 10.
Meaning I want to match the whole outer tag of..
<div id="a"><div><div></div></div></div>
So I search for the tag of id="a" and it returns the whole contents even the subed tags. I'm pretty much done just can't figure out the infinite nested duplicate tags. GLGL.
Sagar • July 11th, 2007
I am trying to super-optimize an html output (getting rid of 2 or more spaces in the response ) but failed to construct a suitable regex, can any body help?
Luke-Jr • September 6th, 2007
I realize I'm a bit late seeing this post, but....
<img title="displays >" src="big.gif">
happens to be INVALID HTML in the first place. > is not allowed even in attributes!
The correct code, which does NOT break your original regex is:
<img title="displays &gt;" src="big.gif">
cioman • October 10th, 2007
</?[a-z][a-z0-9]*[^<>]*>
can strip an html file off all tags.
Can we prevent it from searching the following tags:

<sup>
The application of such a regex would be to preserve all the tags that make a structure of the text, with minimal formatting.
What would be better if one could replace 'i' with 'em' and 'b' with 'strong' (<> removed and '' placed instead, as the post was being messed up when I posted).
Any ideas??
Thanks, in advance.
cioman • October 10th, 2007
Sorry! Re-posting, because the previous post didn't show up well:
---

</?[a-z][a-z0-9]*[^<>]*>
can strip an html file off all tags.
Can we prevent it from searching the following tags:
a, i, b, p, sup and their closing tag equivalents (i have removed <> as the tags weren't displaying when I posted the query earlier)
The application of such a regex would be to preserve all the tags that make a structure of the text, with minimal formatting.
What would be better if one could replace 'i' with 'em' and 'b' with 'strong' (replace '' with <> please, I was having problems posting this query to the forum).
Any ideas??
Thanks, in advance.
Jesse Morrow • October 15th, 2007
I allow the user to submit customized header and footer HTML through a web form. While I'm not too concerned with what they do with the HTML or even if it contains valid tags it is important that their tags be well formed and balanced (i.e. properly matched with ending tags) so that their potentially bad HTML doesn't cause the rest of the web site to get trampled.
I made a Javascript function which takes an HTML snippet as a string and returns true if the HTML is well formed and all tags are properly balanced.
I took the regular expression given here and extended it to match:
1) any opening tag, its *text* content, and its corresponding closing tag,
2) any self-closing tag - such as <br />, <input />,
3) any HTML comments, or
4) pure text (i.e. no HTML tags)
The trick is taking into account the nested nature of HTML which regular expressions aren't expressive enough to match. My trick is to iteratively replace each matched portion with nothing - thus stripping it from the HTML string until it has been stripped down to an empty string. If the loop can strip the HTML iteratively to an empty string then it must be valid and all tags balanced. If the loop hits a point where nothing new is being matched and stripped and yet the string is still not empty then the HTML is invalid or unbalanced.
The reason this works is because each loop iteration strips off all the most deeply nested elements which have no child elements (leaf elements) thus leaving their parent elements as leaf elements for the next iteration.
Here is the code:
var regex = /[^<>]*<(\w+)(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>[^<>]*<\/\1+\s*>[^<>]*|[^<>]*<\w+(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/>[^<>]*||^[^<>]+$/ig;
function validate(html) {
var v = html;
do {
html = v;
v = html.replace(regex, '');
} while( v != html)
return v.length==0;
}
The loop structure and iterative concept is totally stable. The only thing which might need some refinement is the regular expression as I haven't thought too deeply about all the comment and line return possibilities.
Jesse
Jason • October 26th, 2007
This is in regards to this comment.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" id="something">
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Those tags don't get matched because of the non \w characters ":" and "-".
It should be changed to...
$pattern = "/(".
"<\!\w+(?:\s+[^>]*?)+\s*>|".
"<\w+(?:\s+\w+([^>]*)(?:\s*=\s*(?:\"[^\"]*\"|'[^']*'|[^\"'>\s]+))?)*\s*\/?>|".
"<\/\w+\s*>|".
"<\!--[^-]*-->".
")/i";
Alex • November 6th, 2007
First off, thanks for the original regular expression, </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>. It took me seconds, maybe a minute to find what I needed doing a google search. It took me much longer to get this working in my application using JavaScript. Since it was a PITA for me, I am going to show how I did it, because I would have been thankful to find this.
function ContainsHtml(inputText)
{
var htmlRegex = new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);

if(inputText.match(htmlRegex))
{
return true;
}
return false;
}
The difficult part was setting this string </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?> in JavaScript with quotation marks and forward and backslashes. I started off by surrounding the initial regex string with quotation marks, then escaping the appropriate characters, but this did not work.
var re = "</?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)/?>";
After some time, I gave up and when this this approach, which worked.
new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);
Basically, I was escaping the wrong characters. Hope this helps, and thanks again for the regex Phil.
sums • November 15th, 2007
Hi,
I was reading through your posts.
Can you please help me with this problem:(Which is pretty simple but I do not know regular expressions)
I need to define a rule where anypage.aspx converts to anypage (to define in web.config)
ex: www.xyz.com/page1.aspx or www.xyz.com/page2.aspx etc. beocomes
www.xyz.com/page1 and www.xyz.com/page2 respectively.

Looking forward to hearing from you soon.
Thanks..
Jinesh Shah • February 21st, 2008
Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
i hv string like :
string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"
now Using Regular Expression i want value of name attribute not others...so can anybody give me dat regular expression..
ans must be : t1 t2
not f1 or submit ...
i hv tried lots of expression bt still cant find the final one....
Thank u
Jinesh
Jinesh Shah • February 21st, 2008
Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
i hv string like :
string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"
now Using Regular Expression i want value of name attribute not others...so can anybody give me dat regular expression..
ans must be : t1 t2
not f1 or submit ...
i hv tried lots of expression bt still cant find the final one....
Thank u
Jinesh
mj • April 24th, 2008
hi i am doing an assignment and need help!!!!
i am parsing html using c#.net and need help finding img, object, and applet tags using regular expressions. please help!!!
Mic • May 1st, 2008
Hi
I am using preg_split to split words...and then trying to match using
preg_grep, how do i match html tags?
Thanks
tim • May 11th, 2008
I'm lazy and had been using
<(/)?(b|i|u|sup|sub|small|big)([^>]*)>
to match tags. It works pretty well for me since I don't care what's in the tag itself like attributes, but fails on self closing tags. This is a problem for another day.
I thought that I could match everything except the list by adding a "not" operator (^) inside those brackets:
<(/)?(^b|i|u|sup|sub|small|big)([^>]*)>
but it doesn't work. Somebody has made off with my regular expression reference book and the "tutorials" I'm reading online are leaving me more confused than if I just brute force it myself.
Any assistance would be appreciated
Carros DF • May 20th, 2008
<!--[\s\S]*?--[ \t\n\r]*> work nice on DreamWeaver CS to remove comments via SEARCH AND REPLACE box.
Alexander Thorell • June 16th, 2008
Hi and thanks Haack, but your expression \s]+))?)+\s*|\s*)/?> do not seem to match tags with attrubutenames containing hyphen (-) in it, like <meta http-equiv="content-type" content="text/html; charset=UTF-8">
Alexander Thorell • June 16th, 2008
Sorry, but the expression got truncated, but i'm reffering to the updated expression in you earlier comment. Se if works now...
\s]+))?)+\s*|\s*)/?>
Frank Dase • June 26th, 2008
I need an expression for a searchengine to highlight the word I searched for. But I have to ignore matches inside HTML tags.
for example:
<base href="<a rel=" nofollow="" external"="" href="http://www.test.com/laser" title="http://www.test.com/laser">http://www.test.com/laser">this is a laser test...
I want only to match "laser" outside the base tag.
I'm a noop in reg expression, so I hope you can help me. I need it for classic ASP.
wow.. your expression is good • July 1st, 2008
Thank you~^^
moo • July 1st, 2008
While RegExs are good for a variety of purposes, people really need to think about using the HTML DOM for most of their processing needs. If you're just trying to get elements from inside tags or information, using the DOM tends to be easier and more extensible.
Skyetech • July 21st, 2008
Frank D,
I'm trying to do exactly what you're doing with the search engine results. Did you get a solution to your problem?
Jamey Taylor • July 22nd, 2008
Found a case it doesn't handle:
It doesn't match if the attribute name has a non-word character. For example, the hyphen in the first attribute below causes it to not match:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
This has been a huge help to me in maintaining hundreds of non-xhtml compliant pages. Thanks!!!
yuree • July 22nd, 2008
I want to strip non digits and first digit, any one can help?

I use "/[^\d]/g" to remove non digits, but I want to remove first digit too, e.g.

23,185

by using "/[^\d]/g", i see 823185, but i want to trim first digit.

I just want to display "23185", any clue?
Craig Laparo • August 3rd, 2008
I'm a student in the Google Summer of Code program, working for Dojo (http://dojotoolkit.org) and I came across your awesome regular expression for catching HTML tags. I'd like to use it in my code, if it's ok with you. Do you have a CLA?
tp • September 21st, 2008
Hi Friends,

EDITOR: Comment Removed because of HTML formatting issues. Sorry.
Volkan Vardar • October 8th, 2008
<.*?>
should do the same...
Pencheff • October 9th, 2008
For those also working with Delphi and TRegExpr library, here's my Regex for parsing HTML tags (it's based on Phil's one):
RegexParser.Expression := '(?i)<(/?\w+)((\s+(\w+)(\s*=\s*("(.*?)"|[^''">\s]+))?)+\s*|\s*)/?>';
It does return the tag (b, /b, font, /font, etc) in Match[1]
the parameter (size, color, etc) in Match[4] and
the parameter value in Match[7]
sfsdfs • January 7th, 2009
dsfdsfdsfdsfds
f
dsf
ds
f
dsf
smith • January 10th, 2009
http://www.cyworld.com/colap/2360326
(<([\/@!?#]?[^\W_]+)(?:\s|(?:\s(?:[^'">\s]|'[^']*'|"[^"]*")*))*>)|(<\!--[^-]*-->)
is better good.
Joe Bob • January 10th, 2009
HTML is not regular and hence can't be reliably parsed by regular expressions. Please see the following site for more information:
http://htmlparsing.icenine.ca
robert • March 3rd, 2009
i need one regular expression that it show words using code html
Jamp Mark • March 5th, 2009
Here is a regex pattern to capture the URL in anchor tag HREF.
/<a\s+href="([^>]+?)"/
Mo • March 8th, 2009
Hi,
I need to match double quotes within a string, when they are not within html-tags. What ist the pattern to match "up" but not "highlight"?
Hello <span class="highlight">Peter</span>, what's "up"?
I'm using a VB-RegEx-Engine...
Andrei • March 10th, 2009
I am trying to match and replace links in anchor tags. It works with /<a\s+href="([^>]+?)"/ but i need to replace only what is in the href, the url, not all the anchor element..
Vaibhav • March 11th, 2009
i need a regular expression for removing self closing html tag
get entertainment • March 18th, 2009
How do you get the inner text using regex? what if there is several matching tags i.e. <ul><li>content</li><li>content</li>etc. </ul>
Black • March 27th, 2009
I want to get content between this text, but I don't know how to get it by regular
<helloworld>
thifkdslaj
fdskjaflksdj
fasjlk&8*()*$)#(
$*())(#
</helloworld>
So, how can I get content between helloworld tag by Regular Expression?Because between them, there are a lot of line break
and the text which I have is too big+long,so I cann't use replace function
thanks
thanks
Steve Tattersall • March 21st, 2010
I am wanting to remove lines of html code beginning with the start tag <DOCTYPE ...> to the end tag being the </table> how can I best achieve removing multiple lines html code?
Pete B • March 21st, 2010
Excellent.
Just spent the last hour trying to do this and failing:
/<\/?[a-z0-9-_]+(?:''|[^"'>]*(["']?)[\S\s]*?\1)\/?>/gi
Wize • March 27th, 2010
I am trying to write a regular expression that will replace text but not inside of double quotes; and if there aren't any quotes, then it replaces the text
Example: abc "abc" abc
Replace with: def
Expression: (?!")abc(?!")
Result wanted: def "abc" def
Result got: def "abc" def
However, if I have: def "z abc z" def
I get: def "z def z" def
Instead of: def "z abc z" def
using: (?!")abc(?!")
I tried: (?!"\w*)abc(?!\w*")
but got: def "z def z" def
I am hoping to write an expression that will change:
abc abc (to) def def
abc "abc" abc (to) def "abc" def
abc "z abc z" abc (to) def "z abc z" def
Tass • April 8th, 2010
Hi
I have an HTML and i have to replace width:0;height:0;" to width:0px;height:0px;" in the style .
is some have idea what can be regex and replace can be ..

<div class="cool"><img src="http://www.coolchaser.com/images/banner_xray.gif" alt="CoolChaser"></div><img style="visibility:hidden;width:0;height:0;" border=0 width=0 height=0 src="counters.gigya.com/.../bHQ9MTIxNjg3NDUyNjQyMSZwdD*xMjE2ODc*NjA*MDYyJnA9MjEwNjkxJmQ9Jm49bXlzcGFjZSZnPTE=.jpg" />
Anon • July 9th, 2010
stackoverflow.com/.../1732454#1732454
Flavio Troja • July 14th, 2010
I need a regular expression that catch html code between the tags <div class="result">
and <br clear="all">
look like this:
<div class="result">
.... (I whant catch this)
<br clear="all">

can you help me?
celso • July 21st, 2010
How replaceAll myTerm out side of <tag> like
myTerm is <x='z myTerm w'> like myTerm
to
XXXXX is <x='z myTerm w'> like XXXXX
?
Martin Radev • September 5th, 2010
Here are some regular expression from me:
[code]
/< *img[^>]* src *= *["\']?([^"\']*)/is - img tag
/\< *meta[^>]*charset *= *["\']?([^"\']*)/i - encdoing
/\<meta name="description" content *= *["\']?([^"\']*)/i - description
/<title> *(.*) *<\/title>/is - title
[/code]
If you see something wrong you could comment it here :) 10x
Cthulhu • November 13th, 2010
For the love of all that is sane and good in the world, please delete this blog entry and redirect to an article about xpath. Sure, you can get away with a regular expression here and there for html & xml, but when you write about it in a blog there are people who inevitably try to "improve" upon that "little" hack and end up wandering down the path to madness by thinking that it is possible to parse html with regular expressions-- which, of course, is NOT possible.
www.codinghorror.com/...
Darko • November 15th, 2010
I'm trying to match a word inside a link word ... Can you give me a hint?
Thank you!
SuRGeoN • January 6th, 2011
just wrote a Regex in vb .net for following tags:
<tag_name varZ* varX=valueY*>
this regex will also include correctly tags as:
<tag_name varX="value >">
VB .NET Regex (HTML Tags)
Dim regex_tag As String = "(?<extract><[a-zA-Z]+\s*(.|\n)*?>)([^<>]*(?=<)|[^<>]*$)"
Hope you will find it useful
Cheers,
John
rr • February 28th, 2011
How can I match all the html input tags, but only those wich type is text
Rundesigner • August 17th, 2011
Great article many thanks.
Rodrigo • September 25th, 2011

Hi!
Very nice article... I would like to known if you can help me. I need to verify a xml config file with regex.... something like this:
for example: i need to verify if connector of tomcat is on 8080 port..... so i will make a regex expression to find something like this: port="8080". Until now....not a problem....
But, how can i verify if this code is not comented ?.....something like this....

This code is not used...
Could i remove all comments? and analyze just the current code?
I need use just regex....with no programing languages....like c#, php....etc....
Thanks,
Rodrigo Maeda
Norman • October 9th, 2011
Thanks,
adjustment for attr with "-" (http-equiv=".....")
</?\w+((\s+[\w-]+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>
In java:
String pattern = "<(\\w+)((\\s+[\\w-]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
Norman • October 9th, 2011
New adjustment, for first attr with ":" (xmlns:html=".....")
</?\w+((\s+[\w-:]+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>
In java (only open tags):
String pattern = "<(\\w+)((\\s+[\\w-:]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
In java (all tags):
String pattern = "</?(\\w+)((\\s+[\\w-:]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
Tibor • November 21st, 2011
Hi, I have a similar task.
I want to convert to xml from html in oracle.
from:
<root>ff <img src="zzz"> it need <img src = "sdf >">bla bla <img title="displays >" src="big.gif">bla bla<style type="text/css">.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
to
<root>ff  it need bla bla bla bla<style type="text/css">.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
the second one is a standard xml format. I needn't any information about the image and ie. <br> tags.
But if I convert the
<style type ...> to  it will appear the mistake with the </style>
this example your suggestion result will:
<root>ff  it need bla bla bla bla.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
but that is not xml format.
can you suggest me something?
thanks,
Tibor
Manu • February 1st, 2012
ex:INPUT FILE
<html>
<head>
<body>Do it

this is information retrieval

<meta name="abcd"
content ="ebeg"
/meta>

</body>
</html>
OUTPUT FILE SHOULD BE :(WHAT I NEED TO WRITE REGULAR EXPRESSION FOR THIS???????)
Do it
this is information retrieval
"abcd"
"ebeg"
polas • March 6th, 2012
Hi,I want to find how to get embed like this from any website or google search vb.net regex code
<div style="" align=""><object type="application/x-shockwave-flash" data="http://" width="" height=""><param name="" value="">
embed></object>
<br />
anything.</div>
Gayathri Padmakumar • February 25th, 2013
i am a student and a beginner in python. Ur explanation is good

-gayathri
Abdul Basit • March 6th, 2013
this regex always return me null in javascript can any body help me. here is the code..

<html>
<head>
$("#<%= text2.ClientID %>").focusout(function () { alert($(this).val().match("/\s]+))?)+\s*|\s*)/?>/")); });
</head>
<body>
<asp:textbox id="text2" runat="server">
</body>
</html>
Andi • March 21st, 2013
Actually this was very useful to me ... I just needed a quick way of stripping out any HTML and only left with text in a text editor (such as VIM or Notepad++) and for that it works just fine ...
Fabiana • September 16th, 2013
Hi, first of all, sorry for my english.

I would like to use Regular Expression for comments in multiline in C#. I have @"/[*][\w\d\s]+[*]/" but with that expression only comments the text that appears between /* */ in singleline not in multiline.

Singleline:

/* xxxxxxxx */

Multiline:

/*
xxxxxxx
/*

I don't know if I could explain well, but any questions or if you can refer to somewhere that provides this information I would appreciate it.

Thank you very much.
Fabiana.
Dave • March 22nd, 2014
What about the text (non-HTML) phrase:

If BB & this is true
Havitoosh • May 28th, 2014
Unfortunately it will not match tags with enter inside an attribute, such as

<div title="y
yy">xxx</div>
Rahul Vinod Sharma • October 27th, 2014
How can select this whole text ?

<h4>Example:</h4>

<iframe id="preview" style="height: 202px;" width="320" height="150" scrolling="no"></iframe>
<h4>Edit the code below & check the live preview above.</h4>
```
<textarea id="code" name="code"><!DOCTYPE HTML>
<html>
<head>
</head>
<body>
The <abbr title="HyperText Markup Language Help">HTML Help</abbr>Providing HTML5 help.
</body>
</html>
</textarea><script>// </script>
```
scottSEA • February 18th, 2015
I know this is older than dirt, but shame on you for concatenating all those strings in your source code. ;-)
sally • March 6th, 2015
tried this using php and preg_replace, comes up with error:

Warning: Unknown modifier '\' in /var/www/html.php on line 20

line 19: $reg = '\s]+))?)+\s*|\s*)/?>';

line 20: preg_match_all($reg, $html, $matches);

Having problems installing on http://www.filmrally.com/
pam • March 8th, 2015
dude that happens to me all.....the time.
Obsbi • January 15th, 2016
can't wait for AI to automatically generate these code. We would only to explain them what we want, they would do it.
Nick Jackson • May 22nd, 2016
don't forget that attribute and tag names can contain the "-" and "\w" does not contain this
Jay • January 20th, 2017
What if you come across a tag with a bunch of white space in it like this "< div />". Is there a way to modify the above regex to accommodate for this?
breccs • July 21st, 2017
Hey, that looks good but i still have no idea how I can use regex combined with powershell to delete all instances of h1 tags including the text between the tags for example if my script finds <h1 id="This_must_go">This must go</h1>, delete it. I want to delete all the strings with h1 tag from my site because I will move it to one that will generate replacements for all h1 tags based on the file name. I know I am 13 years behind but if you can help it will be great.
Gerald Burkholder • November 14th, 2017
The current expression in this post wont detect the following tag:
<meta http-equiv="X-UA-Compatible" content="ie=edge">

I added a check for dashes in attribute names (\w|-)+

This gives the resulting expression:
<\/?\w+((\s+(\w|-)+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)\/?>
Look • December 12th, 2017
its fantastico...!!
Zakariae Filali • December 20th, 2017
Just want to say big thanks that fixed a big issue where I work ;)
Lau • August 23rd, 2019
Hello.

I’m looking for :

Find : Word To Replace : Word

Find ok with : (<\s\/?\sspan class=”bold”\s.?>)

But I don’t know for replace…
RevolveR • December 19th, 2019
PRCE have many issues and used inside C or C++ … Sometimes we need to move some spaces to hack all of them. I found it when make simple XSS prevention filter with HTML contents cut future and opened tags closing on PHP. Sometimes regex can be hacked with two symbols of [ and (.

It’s like a curse of old mans wrinkles.
RevolveR CMF • December 19th, 2019
So. I can’t find any RegEx for omit tag matching when omit tag is not in strict that’s means there are no contains /> clause in the end and I write some snippet to make all omit tags fixed like HTML strict. It works with my parse future that’s can extract nodes with contents, child’s, contents and attributes from HTML fragment like Nodes objects in JavaScript in web-browser.

Also it can extract attributes of omit-tag correct. You can improve it by checking entities and allowed attributes with contents like Validator but I skip this shit because all Entities of HTML 5 is hidden by W3C into browser and we don’t know about semantics and alignments.

Simple way is to make a whitelists for tags and attributes and filtrate all issues(also we don’t need to use spec for some contents like a XML db because we can now use in back-end store all words for the tags without entities check).

// make omit tags strict preg_match_all(‘/<\/?(meta|img|br|hr|input)(.*?)([?!\/]?>)/mi’, $s, $p, PREG_OFFSET_CAPTURE);

foreach ( $p[ count($p) - 1 ] as $omit) {

$s = substr_replace($s, ‘/>’, $omit[1], strlen($omit[0]));

}