Using a Regular Expression to Match HTML

code, regex 0 comments suggest edit

I just love regular expressions. I mean look at the sample below.


What’s not to like?

Ok I admit, I was a bit intimidated by regular expressions when I first started off as a developer. All I needed was a Substring method and an IndexOf method and I was set. But after a few projects that required some intense text processing, I realized the power and utility of regular expressions. They should be on the tool belt of every developer. To that end, I recommend Mastering Regular Expressions by Jeffrey Friedl. This is really THE book on Regular Expressions. Reading it will make your Regex-Fu powerful.

So let’s look at a common task of matching HTML tags within the body of some text. When you initially think to parse an HTML tag, it seems quite easy. You might consider the following expression:


Roughly Translated, this expression looks for the beginning tag and tag name, followed by some white-space and then anything that doesn’t end the tag.

Now this will probably work 99 times out of 100, but there’s a flaw in this expression. Do you see it? What if I asked you to match the following tag?

<img title="displays >" src="big.gif">

Hopefully you see the issue here. The expression will match

<img title="displays >

Unfortunately, this implementation is too naive. We have to consider the fact that the greater-than symbol does not end a tag if it’s within a quoted attribute value. Thus we must correctly match attributes.

Now there are four possible formats for an Html attribute

name="double quoted value" name='single quoted value' name=notquotedvaluewithnowhitespace name

Each of these cases are quite simple. In the first case, you could do the following:


The portion "[\^"]*" matches a double quote, followed by any non double quote characters, followed by a double quote. Another way to express this is to use lazy evaluation as such:


The portion ".*?" uses lazy evaluation (the “lazy star”) to match as few characters as possible. For example, if we had a string like so

<a name=test value="test2">

evaluating ".*" (aka greedy) would match

"test" value="test2"

However using the lazy evaluation consumes the fewest characters that match the expression, thus the first match using ".*?" would be "test" and the second match is "test2".

The full expression for matching an HTML tag is that lovely mash of characters presented at the very beginning of this post. It’s a modified version of the one presented in Friedl’s book

However I wouldn’t recommend you just plunk that down in your code. Rather, you should consider adding it to a regular expression library assembly.

Don’t know how? Well I’ll show you a code listing for an exe that when run, builds a fully compiled version of this regular expression into an assembly that you can then reference in any project. In a later installment, I’ll explain in more detail just what the code is doing and how to use the compiled assembly. How irresponsible of me not to do that now. ;)

Source Listing

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.



126 responses

  1. Avatar for Dimitri Glazkov
    Dimitri Glazkov October 25th, 2004

    Look at you go, man. That's good stuff. After all of that partisan politics junk, it's refreshing to see a good tech post :)

  2. Avatar for Haacked
    Haacked October 25th, 2004

    Ha ha ha... Thanks. At this point I think the two undecided people in this world will have to make up their own minds and not have me tell them what to think (though I think I would do a fine job of that).

    You should hopefully see some more good techie posts coming up.

  3. Avatar for Pat
    Pat October 25th, 2004

    Just to play devil's advocate for a minute, it seems like HTML is just too darned close to XML to have to parse this way. Isn't there a library out there for converting HTML into XHTML? If you can do that, you can just read the file in using XmlDocument::LoadXml(). Once you've done that, you can find your tags using an XPath query. Sorry, I just couldn't let a parsing post go by without tossing in my two cents ;)

  4. Avatar for Haacked
    Haacked October 25th, 2004

    Pat, you're absolutely right. There is an SGML library as well as the HTML agility pack (

    However, we (RSS Bandit) found that SGML was too heavyweight, poor performing, and even a bit buggy for the simple task of searching HTML for links to RSS Feeds.

    I was tasked with replacing SGML with regular expressions and it performs quite well.

  5. Avatar for Jon Galloway
    Jon Galloway October 26th, 2004

    Great info. I've never really understood the lazy / greedy match thing before. Thanks.

    Using RegEx's on nested HTML gets more difficult. We messed with balancing groups, which are supposed to handle nested constructs, we never got it working. Every sample (MSDN, Dan Appleman's book, etc.) was the same nested perintheses thing, but didn't work for HTML.

    Did you handle nesting in your HTML parsing?

  6. Avatar for Haacked
    Haacked October 26th, 2004

    Funny you mention this. I was just talking about this with a friend and on a newsgroup.

    Regular expressions just aren't well suited for nested matching. The balanced groupings is a Microsoft innovation to regular expressions, so it's not something I've played around with much.

    Since the info I needed was inside a tag, my regular expression works fine for that type of processing. You could also use it to strip all tags from a document.

    If I was using it to actually parse an HTML doc (which I have some code that does), I keep track of indices and everytime I match a tag, I record the beginning and end index. Then I compare that with the previous matched tag indices and I grab the content between.

  7. Avatar for Simon Mourier
    Simon Mourier October 27th, 2004

    What about comments (<!-- blablah -- >). You regex matches links in comments too, right? Same remark with <script> and <style>?

    Simon (Just trying to be annoying . You know me :-)

  8. Avatar for Haacked
    Haacked October 27th, 2004

    Oh! You're eeeevil! You are absolutely right. It would match tags within comments.

    It's easy enough to strip out comments before parsing.

  9. Avatar for Haacked
    Haacked November 4th, 2004

    I found a mistake. For some reason my blogging engine capitalized some characters. Also, if a tag is on multiple lines, the expression above is broken. Here's my updated one.


  10. Avatar for tester
    tester March 2nd, 2005

    <iframe src=""></iframe>some text

  11. Avatar for Ash
    Ash March 4th, 2005

    How can I use regex to match a HTML comment

    eg: <!-- blah blah blah -->


  12. Avatar for Jim MacDiarmid
    Jim MacDiarmid April 4th, 2005


    I'm working on a template parsing engine and I was wondering how I would go about using Regex to capture text(html) between custom tags?


  13. Avatar for Haacked
    Haacked April 4th, 2005

    Hi Jim,

    That's actually quite challenging. You could attempt to use .NET's balanced matching mechanism, but it's pretty difficult to get the grasp of and it's not standard Regex.

    It really depends what you're trying to accomplish. One way I've done it is to use a regex that matches html tags and then strip all tags out.

    Hope that helps.

  14. Avatar for Adrian
    Adrian April 13th, 2005

    Can anyone help me out? I'm trying to build a regular expression to validate an SQL full text query. So something like:

    "cable television" and "wireless bluetooth" OR "3G"

    to ensure all booleans operators are outside quotes, and all text is in quotes.

    Is this possible with a regular expression? or the wrong approach?



  15. Avatar for Haacked
    Haacked April 13th, 2005

    I'm not too familiar with SQL full text query syntax. How do you escape a double quote within the text?

    The naive approach is something like:


    This is basically matching an expression like:

    "some text"


    "some text" and "some more text"


    "some text" and "some more" or "even more"

    and so on.

    This expression doesn't handle escaped double quotes within the text.

  16. Avatar for Don
    Don April 21st, 2005

    How can I get rid of a comma and a paranthesis inside a tag.


    <p(,)> this is the function calculate() </p(,)>

    after replace it should look like:

    <p> this is the function calculate() </p>

  17. Avatar for Haacked
    Haacked April 22nd, 2005

    Are commas allowed within the attributes of a tag? For example:

    <p title="This has a comma, right here.">

    Should the regex strip that comma as well? If so it's much easier than if not.


  18. Avatar for Don
    Don April 22nd, 2005

    yes if there is a comma inside < and > tag or < and /> it should be removed.


    right now i'm using [(),] which get rid of them even in the value tag.

    <p(,)> this is the function, calculate() </p(,)>

    after replace it should look like:

    <p> this is the function, calculate() </p>

  19. Avatar for Haacked
    Haacked April 22nd, 2005

    Just to be clear, in my example above, you'd want the comma removed from the title attribute as well, right?

    If so, I'd just use the html expression to match tags


    Once you have the tag, replace , with empty space.

    Much easier to take a two-step approach on this one.

  20. Avatar for Haacked
    Haacked April 22nd, 2005

    Ah, I see the problem. Commas aren't allowed in valid HTML, so my expression won't match it. However I have a code snippet that will. I haven't tested it super thoroughly, but it worked on some samples I threw at it. It won't remove commas within an attribute value.

    public string RemoveCommas(string html)


    Regex regex = new Regex(@"</?\w+(((\s|\n|,)+\w+((\s|\n|,)*=(\s|\n|,)*(?:"".*?""|'.*?'|[^'"">\s]+))?)+(\s|\n|,)*|(\s|\n|,)*)/?>", RegexOptions.Singleline);

    int lastIndex = 0;

    StringBuilder result = new StringBuilder();

    MatchCollection matches = regex.Matches(html);

    foreach(Match match in matches)


    result.Append(html.Substring(lastIndex, match.Index - lastIndex));

    result.Append(match.Value.Replace(",", ""));

    lastIndex = match.Index + match.Value.Length;



    return result.ToString();


  21. Avatar for Don
    Don April 22nd, 2005

    I like the function method. Although it didn't replace any commas.

    But let me tell you wat I really want. I'm reading an xml file line by line. And some xml tags has commas.

    So I want to delete those commas. That's only in name tag. If a comma in value eg. <something>bla bla, bla</something>

    it should leave intact.

    Here's the code I have right now:

    string theThingToReplaceWith = "";

    Regex exp = new Regex(@"[,]");



    while(myStreamReader.Peek() != -1)


    theLine = myStreamReader.ReadLine();

    theLine = exp.Replace(theLine,theThingToReplaceWith);




    catch(EndOfStreamException eose)




  22. Avatar for Help Me
    Help Me May 20th, 2005

    how to repalce alphanumeric charecters between a tag with a "-"


    <Tag>Aapple's Great % Fruit</Tag>

    <Tag>Aapple's Bad # Fruit</Tag>

    should become

    <Tag>Aapple-s Great - Fruit</Tag>

    <Tag>Aapple-s Bad - Fruit</Tag>

  23. Avatar for Mark Fletcher
    Mark Fletcher May 26th, 2005


    Good article! I was wondering if anyone can recommend a strategy for parsing HTML for a web spider? I want it to be able to match links, and find links in javascript. So far Ive been using one expression to extract the links. However with javascript code embedded in a file, I think Id be better off running two passes -

    1) Grab the links in tags

    2) Make a pass for anything in javascript tags

    What do you think?

  24. Avatar for Robin
    Robin June 4th, 2005

    tried this using php and preg_replace, comes up with error:

    Warning: Unknown modifier '\' in /var/www/html.php on line 20

    line 19: $reg = '</?\W+((\S+\W+(\S*=\S*(?:".*?"|'."'".'.*?'."'".'|[^'."'".'">\s]+))?)+\s*|\s*)/?>';

    line 20: preg_match_all($reg, $html, $matches);

  25. Avatar for haacked
    haacked June 5th, 2005

    Hi Robin, this syntax is particular to .NET, so I'm not sure if you have to make some modifications for PHP.

    Also, the expression you're trying is not correct. The corrected expression is at the following URL (

  26. Avatar for Reddy
    Reddy June 10th, 2005

    Please let me know the regular expression to detect one or more only white spaces. My requirement is I should raise error message if some body enters only white spaces in the Order name text box.

  27. Avatar for Pasan
    Pasan June 11th, 2005

    This article is good.

    How can we use a regular expression to read text within specific html tags?

    Also if we want to read text within the first parapgraph tags of a web page how can we use a regular expression to do tha?

  28. Avatar for Vadym
    Vadym May 11th, 2006

    Brilliant! But how about case like this:
    <td> Some weird <text> goes here </td> ? Expression above will treat <text> like tag and replace with "". To prevent this from happening I loop through the string and get anything that is not a tag but looks like tag (I have to match <text> to collection of valid tags). Save all occurrences in temp variables (something like #0001=’<text>’, #002=’<text 2>’ etc…). After I run regexp above to strip all HTML tags. And then loop again and replace all temp variables with corresponding values. Pain a bit ;-) So my output looks like:
    Some weird <text> goes here. Any better ideas?

  29. Avatar for Haacked
    Haacked May 11th, 2006

    The problem is since <text> is not escaped, it probably won't get rendered by the browser, so stripping it shouldn't be a problem.
    Ideally, you wouldn't allow that sort of thing. You'd HTML encode it so it looks like &lt;text&gt;

  30. Avatar for Kjeks
    Kjeks May 31st, 2006

    Since the question came up: HTML tags are effectively matched with the regexp
    .*? provides non-greedy matching. Use single line option for good karma :-)

  31. Avatar for rubendj
    rubendj June 5th, 2006

    <!--.*?--> doesn't work in cases like:
    <!--comments1--> visible text <!--comments2-->
    because it matchs entire line.
    This reg. exp. works better: <!--[^(-->)]*-->

  32. Avatar for Kjeks
    Kjeks June 14th, 2006

    Actually, <!--.*?--> works very well for the case you're describing.
    The '?' in '.*?' makes the match non-greedy, meaning that as few characters as possible will be matched. This works in Perl and Python, at least. Other languages may have another syntax.

  33. Avatar for Freddy
    Freddy July 6th, 2006

    Here is a regex that will accept form input, as long as it doesn't contain any HTML or JSP comment tags:
    ^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$
    It works on the client-side (JavaScript RegEx engine) but fails on the server side with an "Invalid RegEx" error.
    Does anyone know how to write the expression so that it will validate against the XML specification for RegEx?
    More details follow:
    I need a regex that will reject form data that contains HTML or JSP comments.
    I am contstrained to entering the pattern into a proprietery xml editor.
    I don't have the option of using substitution or NOT operators - only the expression itself.
    Page 198 in the Perl Cookbook provides the basic format for a NOT regex:
    And the RegExLib provided the HTML comment pattern that I want to dis-allow:
    <!--[\s\S]*?--[ \t\n\r]*>
    My Sax parser wont allow a less-than character in the expression, so I will leave it out.
    I also want to dis-allow JSP comments, so I added a % as an alternate to the ! at the beginning and as a member of the character set at the end.
    ^(?:(?!(!|%)--[\s\S]*?--[ %\t\n\r]*>).)*$
    It works fine on the client-side, but fails on the server-side.
    How do I write an equivelant regex that will validate against the XML spec?

  34. Avatar for Shawn
    Shawn August 8th, 2006

    ^(?:(?!(!|%)[\s\S]*?--[ %\t\n\r]*>).)*$
    *should* also find the CDATA used in XML to escape HTML characters. This is untested, but it builds off
    <![\s\S]*?--[ \t\n\r]*>
    which is what I use (fully tested on thousands of pages, but I'm not looking for JSP comments)

  35. Avatar for John
    John September 10th, 2006

    You are a RegEx genius!
    Is it possible to enhance the expression to get a reference to the inner html if present, something like (?<innerHtml>)
    For example, for the value:
    <title>My Title</title>
    I could retrieve the value "MyTitle"

  36. Avatar for JMARIN
    JMARIN October 24th, 2006

    does it works in ASP.NET HTML Tag like:

  37. Avatar for santhosh
    santhosh November 19th, 2006

    I want to write a regular expression for commenting out all the script tags in a page which is not present inside the comment block.
    For ex:

    some text here.....
    <script type="text/javascript">
    alert("Hi santhosh");
    some text here......

    I should comment the script section above. The output should be

    some text here.....
    <!--script type="text/javascript">
    alert("Hi santhosh");
    some text here......

    2) But i should not comment for the following scenario. The portion of code is already inside the comment.

    some text here.....
    some text here.....
    <script type="text/javascript">
    alert("Hi santhosh");
    some text here......
    some text here......

    I have written the regex for commenting all the script tags.
    The following is the regex i used
    Please help me.....

  38. Avatar for Haacked
    Haacked November 19th, 2006

    Two options I can think of.
    1. Before processing, remove commented out scripts. For example: Replace:
    "<!--<script>" with "<script>"
    and </script>--> with "</script>"
    And then run your original replacement.
    Use the negative lookahead and negative lookbehind to make sure the script you're replacing doesn't already have comments.

  39. Avatar for Milenko Curcin
    Milenko Curcin December 7th, 2006

    I'm trying to make a good regexp for finding html tags so that i could colour them (program for colouring code) and there is one problem with this regexp, it will mach <br> inside <pre> like for example here
    <pre>some text<br>some more text</pre>
    and this shouldn't happen, <br> is not a tag anymore, it is a text. my knowledge of regexp is not big and i don't know how to fix this :(

  40. Avatar for Ashly
    Ashly December 19th, 2006

    I am working with PHP 5.0
    I have a string like this:

    function getSize($userIDArr)
    $arrSize = sizeof($userIDArr);
    function getUserNames($userIDArr)
    for($i=0; $i < sizeof($userIDArr); ++$i)
    $userNamesArr[$i] = $userIDArr[$i];
    return $userNamesArr;

    I need to replace the outer tag [code] [/code] pair with <abc> </abc>

    The result should look like:

    function getSize($userIDArr)
    $arrSize = sizeof($userIDArr);
    function getUserNames($userIDArr)
    for($i=0; $i < sizeof($userIDArr); ++$i)
    $userNamesArr[$i] = $userIDArr[$i];
    return $userNamesArr;

    If anyone have any idea, please help me..
    Thanks in advance

  41. Avatar for Ashly
    Ashly December 19th, 2006

    My email id is:

  42. Avatar for SteveLionbird
    SteveLionbird December 28th, 2006


    .*? provides non-greedy matching. Use single line option for good karma :-)

    Excellent tip .. I'm a CF developer and could not find that documented. CF does not support lookbehinds so I was pulling my hair out trying to grab content between matching opening and closing tags accurately.

  43. Avatar for Nokturnal
    Nokturnal January 16th, 2007

    Can anyone port this over to work within javascript?
    Cheers and thanks for the great tutorial here!

  44. Avatar for Nokturnal
    Nokturnal January 16th, 2007

    Oh man, forget what I typed above. The error was in the actual implementation of the javascript itself.
    Cheers and sorry to waste your time :)

  45. Avatar for lb
    lb April 11th, 2007

    >You’ll have to use the SingleLine RegexOption for it to work
    i had this same issue with a regex yesterday... how counter-intuitive is it that you use the singleline option to match a pattern that spans multiple lines?
    i guess their thinking is that the singleline option means "treat this input as a single line" -- but that's not how we think when trying to get a regex to work. We think more along the lines of 'i want my pattern to match, even when it spans multiple-lines'
    you know i wish i'd found this blog post yesterday... i had many of the same issues: the need for non-greedy matches, the need for the single line option... but my regex-fu is weak. i got there in the end.

  46. Avatar for Robert
    Robert April 19th, 2007

    I am versed in regexs so here is an idea I am testing for php. It matches doctype, open, close, and comment tags. It isn't perfect but it gets closer to html rules. This fixes the bug where a tag could be </div /> and is multi-line compatible.
    preg version:

    php version:
    $pattern = "/(".
    @preg_match_all($pattern, $filedata, $matchesArray);

  47. Avatar for Chris
    Chris April 22nd, 2007

    Hi Folks,
    I'm trying to get plain text out of "any" html document, which seems to be quite difficult to do cause most of the regex I tried doesn't match scripts, styles and so on. Any suggestions to do it in a different way, I'm programming in C# .NET
    Thanks in advance and sorry for my poor english

  48. Avatar for Conrad de Wet
    Conrad de Wet April 24th, 2007

    Been searching around to find Javascript that will convert <TAG attrib=xyz> into <TAG attrib="xyz">
    I would imagine this code you have developed is close im just not sure how to use it.
    Reason: Using a CMS for editing and creating HTML with the output as XML to flash. The textField.htmlText does not support unquoted attributes.
    Any assistance appreciated.

  49. Avatar for Haacked
    Haacked April 24th, 2007

    The easiest way to do this is to download the Subtext Source Code and do a search for the method ConvertHtmlToXHtml.
    You'll need to add a reference to SgmlReaderDll.dll which is included in the source.

  50. Avatar for anon
    anon May 4th, 2007

    Great regex, but just to note it doesn't work with attributes that do not have quotes:
    [div onclick=alert("hacked") ]
    I've been messing with the regex as it is and there doesn't seem to be an easy way to support all three styles.
    any ideas?

  51. Avatar for Carlos
    Carlos May 28th, 2007

    How can I retrieve only the text that is between the tags <title> and </title>?

  52. Avatar for NeoGeo
    NeoGeo June 4th, 2007

    Can someone help me? I would like to search for script tags. I would like to have it check either the opening tag or closing tag or both. I just need to search if the string have a script tag.
    Thank you in advance.

  53. Avatar for Shail
    Shail June 14th, 2007

    I am not strong in regular express. I am stuck while validation one string. Actually I want test or test or test or any string. Can you help me to generate regex for this validation. Thaxs

  54. Avatar for Brad
    Brad July 9th, 2007

    I don't see any use in this regex.. I'm trying to do something more advanced I havn't seen done yet.
    I'm matching HTML tags too.. but the whole tag not just the start tag. I want the start and the end tag. I havn't been able to figure out how to do the recursive part, only a fixed amount of levels, like 10.
    Meaning I want to match the whole outer tag of..
    <div id="a"><div><div></div></div></div>
    So I search for the tag of id="a" and it returns the whole contents even the subed tags. I'm pretty much done just can't figure out the infinite nested duplicate tags. GLGL.

  55. Avatar for Sagar
    Sagar July 11th, 2007

    I am trying to super-optimize an html output (getting rid of 2 or more spaces in the response ) but failed to construct a suitable regex, can any body help?

  56. Avatar for Luke-Jr
    Luke-Jr September 6th, 2007

    I realize I'm a bit late seeing this post, but....
    &lt;img title="displays &gt;" src="big.gif"&gt;
    happens to be INVALID HTML in the first place. &gt; is not allowed even in attributes!
    The correct code, which does NOT break your original regex is:
    &lt;img title="displays &amp;gt;" src="big.gif"&gt;

  57. Avatar for cioman
    cioman October 10th, 2007
  58. Avatar for cioman
    cioman October 10th, 2007

    Sorry! Re-posting, because the previous post didn't show up well:

    can strip an html file off all tags.
    Can we prevent it from searching the following tags:
    a, i, b, p, sup and their closing tag equivalents (i have removed <> as the tags weren't displaying when I posted the query earlier)
    The application of such a regex would be to preserve all the tags that make a structure of the text, with minimal formatting.
    What would be better if one could replace 'i' with 'em' and 'b' with 'strong' (replace '' with <> please, I was having problems posting this query to the forum).
    Any ideas??
    Thanks, in advance.

  59. Avatar for Jesse Morrow
    Jesse Morrow October 15th, 2007

    I allow the user to submit customized header and footer HTML through a web form. While I'm not too concerned with what they do with the HTML or even if it contains valid tags it is important that their tags be well formed and balanced (i.e. properly matched with ending tags) so that their potentially bad HTML doesn't cause the rest of the web site to get trampled.
    I made a Javascript function which takes an HTML snippet as a string and returns true if the HTML is well formed and all tags are properly balanced.
    I took the regular expression given here and extended it to match:
    1) any opening tag, its *text* content, and its corresponding closing tag,
    2) any self-closing tag - such as <br />, <input />,
    3) any HTML comments, or
    4) pure text (i.e. no HTML tags)
    The trick is taking into account the nested nature of HTML which regular expressions aren't expressive enough to match. My trick is to iteratively replace each matched portion with nothing - thus stripping it from the HTML string until it has been stripped down to an empty string. If the loop can strip the HTML iteratively to an empty string then it must be valid and all tags balanced. If the loop hits a point where nothing new is being matched and stripped and yet the string is still not empty then the HTML is invalid or unbalanced.
    The reason this works is because each loop iteration strips off all the most deeply nested elements which have no child elements (leaf elements) thus leaving their parent elements as leaf elements for the next iteration.
    Here is the code:
    var regex = /[^<>]*<(\w+)(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>[^<>]*<\/\1+\s*>[^<>]*|[^<>]*<\w+(?:(?:\s+\w+(?:\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/>[^<>]*|<!--.*?-->|^[^<>]+$/ig;
    function validate(html) {
    var v = html;
    do {
    html = v;
    v = html.replace(regex, '');
    } while( v != html)
    return v.length==0;
    The loop structure and iterative concept is totally stable. The only thing which might need some refinement is the regular expression as I haven't thought too deeply about all the comment and line return possibilities.

  60. Avatar for Jason
    Jason October 26th, 2007

    This is in regards to this comment.
    <html xmlns="" xml:lang="en" lang="en" id="something">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    Those tags don't get matched because of the non \w characters ":" and "-".
    It should be changed to...
    $pattern = "/(".

  61. Avatar for Alex
    Alex November 6th, 2007

    First off, thanks for the original regular expression, </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>. It took me seconds, maybe a minute to find what I needed doing a google search. It took me much longer to get this working in my application using JavaScript. Since it was a PITA for me, I am going to show how I did it, because I would have been thankful to find this.
    function ContainsHtml(inputText)
    var htmlRegex = new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);

    return true;
    return false;
    The difficult part was setting this string </?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?> in JavaScript with quotation marks and forward and backslashes. I started off by surrounding the initial regex string with quotation marks, then escaping the appropriate characters, but this did not work.
    var re = "</?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)/?>";
    After some time, I gave up and when this this approach, which worked.
    new RegExp(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gim);
    Basically, I was escaping the wrong characters. Hope this helps, and thanks again for the regex Phil.

  62. Avatar for sums
    sums November 15th, 2007

    I was reading through your posts.
    Can you please help me with this problem:(Which is pretty simple but I do not know regular expressions)
    I need to define a rule where anypage.aspx converts to anypage (to define in web.config)
    ex: or etc. beocomes and respectively.

    Looking forward to hearing from you soon.

  63. Avatar for Jinesh Shah
    Jinesh Shah February 21st, 2008

    Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
    i hv string like :
    string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"
    now Using Regular Expression i want value of name attribute not can anybody give me dat regular expression..
    ans must be : t1 t2
    not f1 or submit ...
    i hv tried lots of expression bt still cant find the final one....
    Thank u

  64. Avatar for Jinesh Shah
    Jinesh Shah February 21st, 2008

    Hi ALl...i got a nice script here...Bt stilli hv some problem..i m explaining here...
    i hv string like :
    string htmlstring ="<form name="f1"> <input type="text" name="t1"><input type="text" name="t2"><input type="submit" name="submit"></form>"
    now Using Regular Expression i want value of name attribute not can anybody give me dat regular expression..
    ans must be : t1 t2
    not f1 or submit ...
    i hv tried lots of expression bt still cant find the final one....
    Thank u

  65. Avatar for mj
    mj April 24th, 2008

    hi i am doing an assignment and need help!!!!
    i am parsing html using and need help finding img, object, and applet tags using regular expressions. please help!!!

  66. Avatar for Mic
    Mic May 1st, 2008

    I am using preg_split to split words...and then trying to match using
    preg_grep, how do i match html tags?

  67. Avatar for tim
    tim May 11th, 2008

    I'm lazy and had been using
    to match tags. It works pretty well for me since I don't care what's in the tag itself like attributes, but fails on self closing tags. This is a problem for another day.
    I thought that I could match everything except the list by adding a "not" operator (^) inside those brackets:
    but it doesn't work. Somebody has made off with my regular expression reference book and the "tutorials" I'm reading online are leaving me more confused than if I just brute force it myself.
    Any assistance would be appreciated

  68. Avatar for Carros DF
    Carros DF May 20th, 2008

    <!--[\s\S]*?--[ \t\n\r]*> work nice on DreamWeaver CS to remove comments via SEARCH AND REPLACE box.

  69. Avatar for Alexander Thorell
    Alexander Thorell June 16th, 2008

    Hi and thanks Haack, but your expression \s]+))?)+\s*|\s*)/?> do not seem to match tags with attrubutenames containing hyphen (-) in it, like <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  70. Avatar for Alexander Thorell
    Alexander Thorell June 16th, 2008

    Sorry, but the expression got truncated, but i'm reffering to the updated expression in you earlier comment. Se if works now...

  71. Avatar for Frank Dase
    Frank Dase June 26th, 2008

    I need an expression for a searchengine to highlight the word I searched for. But I have to ignore matches inside HTML tags.
    for example:
    <base href="&lt;a rel=" nofollow="" external"="" href="" title="">">this is a laser test...
    I want only to match "laser" outside the base tag.
    I'm a noop in reg expression, so I hope you can help me. I need it for classic ASP.

  72. Avatar for wow.. your expression is good
    wow.. your expression is good July 1st, 2008

    Thank you~^^

  73. Avatar for moo
    moo July 1st, 2008

    While RegExs are good for a variety of purposes, people really need to think about using the HTML DOM for most of their processing needs. If you're just trying to get elements from inside tags or information, using the DOM tends to be easier and more extensible.

  74. Avatar for Skyetech
    Skyetech July 21st, 2008

    Frank D,
    I'm trying to do exactly what you're doing with the search engine results. Did you get a solution to your problem?

  75. Avatar for Jamey Taylor
    Jamey Taylor July 22nd, 2008

    Found a case it doesn't handle:
    It doesn't match if the attribute name has a non-word character. For example, the hyphen in the first attribute below causes it to not match:
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    This has been a huge help to me in maintaining hundreds of non-xhtml compliant pages. Thanks!!!

  76. Avatar for yuree
    yuree July 22nd, 2008

    I want to strip non digits and first digit, any one can help?

    I use "/[^\d]/g" to remove non digits, but I want to remove first digit too, e.g.


    by using "/[^\d]/g", i see 823185, but i want to trim first digit.

    I just want to display "23185", any clue?

  77. Avatar for Craig Laparo
    Craig Laparo August 3rd, 2008

    I'm a student in the Google Summer of Code program, working for Dojo ( and I came across your awesome regular expression for catching HTML tags. I'd like to use it in my code, if it's ok with you. Do you have a CLA?

  78. Avatar for tp
    tp September 21st, 2008

    Hi Friends,

    EDITOR: Comment Removed because of HTML formatting issues. Sorry.

  79. Avatar for Volkan Vardar
    Volkan Vardar October 8th, 2008

    should do the same...

  80. Avatar for Pencheff
    Pencheff October 9th, 2008

    For those also working with Delphi and TRegExpr library, here's my Regex for parsing HTML tags (it's based on Phil's one):
    RegexParser.Expression := '(?i)<(/?\w+)((\s+(\w+)(\s*=\s*("(.*?)"|[^''">\s]+))?)+\s*|\s*)/?>';
    It does return the tag (b, /b, font, /font, etc) in Match[1]
    the parameter (size, color, etc) in Match[4] and
    the parameter value in Match[7]

  81. Avatar for sfsdfs
    sfsdfs January 7th, 2009


  82. Avatar for smith
    smith January 10th, 2009
    is better good.

  83. Avatar for Joe Bob
    Joe Bob January 10th, 2009

    HTML is not regular and hence can't be reliably parsed by regular expressions. Please see the following site for more information:

  84. Avatar for robert
    robert March 3rd, 2009

    i need one regular expression that it show words using code html

  85. Avatar for Jamp Mark
    Jamp Mark March 5th, 2009

    Here is a regex pattern to capture the URL in anchor tag HREF.

  86. Avatar for Mo
    Mo March 8th, 2009

    I need to match double quotes within a string, when they are not within html-tags. What ist the pattern to match "up" but not "highlight"?
    Hello <span class="highlight">Peter</span>, what's "up"?
    I'm using a VB-RegEx-Engine...

  87. Avatar for Andrei
    Andrei March 10th, 2009

    I am trying to match and replace links in anchor tags. It works with /<a\s+href="([^>]+?)"/ but i need to replace only what is in the href, the url, not all the anchor element..

  88. Avatar for Vaibhav
    Vaibhav March 11th, 2009

    i need a regular expression for removing self closing html tag

  89. Avatar for get entertainment
    get entertainment March 18th, 2009

    How do you get the inner text using regex? what if there is several matching tags i.e. <ul><li>content</li><li>content</li>etc. </ul>

  90. Avatar for Black
    Black March 27th, 2009

    I want to get content between this text, but I don't know how to get it by regular
    So, how can I get content between helloworld tag by Regular Expression?Because between them, there are a lot of line break
    and the text which I have is too big+long,so I cann't use replace function

  91. Avatar for Steve Tattersall
    Steve Tattersall March 21st, 2010

    I am wanting to remove lines of html code beginning with the start tag <DOCTYPE ...> to the end tag being the </table> how can I best achieve removing multiple lines html code?

  92. Avatar for Pete B
    Pete B March 21st, 2010

    Just spent the last hour trying to do this and failing:


  93. Avatar for Wize
    Wize March 27th, 2010

    I am trying to write a regular expression that will replace text but not inside of double quotes; and if there aren't any quotes, then it replaces the text
    Example: abc "abc" abc
    Replace with: def
    Expression: (?!")abc(?!")
    Result wanted: def "abc" def
    Result got: def "abc" def
    However, if I have: def "z abc z" def
    I get: def "z def z" def
    Instead of: def "z abc z" def
    using: (?!")abc(?!")
    I tried: (?!"\w*)abc(?!\w*")
    but got: def "z def z" def
    I am hoping to write an expression that will change:
    abc abc (to) def def
    abc "abc" abc (to) def "abc" def
    abc "z abc z" abc (to) def "z abc z" def

  94. Avatar for Tass
    Tass April 8th, 2010

    I have an HTML and i have to replace width:0;height:0;" to width:0px;height:0px;" in the style .
    is some have idea what can be regex and replace can be ..

    <div class="cool"><img src="" alt="CoolChaser"></div><img style="visibility:hidden;width:0;height:0;" border=0 width=0 height=0 src="*xMjE2ODc*NjA*MDYyJnA9MjEwNjkxJmQ9Jm49bXlzcGFjZSZnPTE=.jpg" />

  95. Avatar for Anon
    Anon July 9th, 2010
  96. Avatar for Flavio Troja
    Flavio Troja July 14th, 2010

    I need a regular expression that catch html code between the tags <div class="result">
    and <br clear="all">
    look like this:
    <div class="result">
    .... (I whant catch this)
    <br clear="all">

    can you help me?

  97. Avatar for celso
    celso July 21st, 2010

    How replaceAll myTerm out side of <tag> like
    myTerm is <x='z myTerm w'> like myTerm
    XXXXX is <x='z myTerm w'> like XXXXX

  98. Avatar for Martin Radev
    Martin Radev September 5th, 2010

    Here are some regular expression from me:
    /< *img[^>]* src *= *["\']?([^"\']*)/is - img tag
    /\< *meta[^>]*charset *= *["\']?([^"\']*)/i - encdoing
    /\<meta name="description" content *= *["\']?([^"\']*)/i - description
    /<title> *(.*) *<\/title>/is - title
    If you see something wrong you could comment it here :) 10x

  99. Avatar for Cthulhu
    Cthulhu November 13th, 2010

    For the love of all that is sane and good in the world, please delete this blog entry and redirect to an article about xpath. Sure, you can get away with a regular expression here and there for html & xml, but when you write about it in a blog there are people who inevitably try to "improve" upon that "little" hack and end up wandering down the path to madness by thinking that it is possible to parse html with regular expressions-- which, of course, is NOT possible.

  100. Avatar for Darko
    Darko November 15th, 2010

    I'm trying to match a word inside a link word ... Can you give me a hint?
    Thank you!

  101. Avatar for SuRGeoN
    SuRGeoN January 6th, 2011

    just wrote a Regex in vb .net for following tags:
    <tag_name varZ* varX=valueY*>
    this regex will also include correctly tags as:
    <tag_name varX="value >">
    VB .NET Regex (HTML Tags)
    Dim regex_tag As String = "(?<extract><[a-zA-Z]+\s*(.|\n)*?>)([^<>]*(?=<)|[^<>]*$)"
    Hope you will find it useful

  102. Avatar for rr
    rr February 28th, 2011

    How can I match all the html input tags, but only those wich type is text

  103. Avatar for Rundesigner
    Rundesigner August 17th, 2011

    Great article many thanks.

  104. Avatar for Rodrigo
    Rodrigo September 25th, 2011

    Very nice article... I would like to known if you can help me. I need to verify a xml config file with regex.... something like this:
    for example: i need to verify if connector of tomcat is on 8080 port..... so i will make a regex expression to find something like this: port="8080". Until now....not a problem....
    But, how can i verify if this code is not comented ?.....something like this....
    <Connector port="8080" .....
    This code is not used...
    Could i remove all comments? and analyze just the current code?
    I need use just regex....with no programing c#, php....etc....
    Rodrigo Maeda

  105. Avatar for Norman
    Norman October 9th, 2011

    adjustment for attr with "-" (http-equiv=".....")
    In java:
    String pattern = "<(\\w+)((\\s+[\\w-]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";

  106. Avatar for Norman
    Norman October 9th, 2011

    New adjustment, for first attr with ":" (xmlns:html=".....")
    In java (only open tags):
    String pattern = "<(\\w+)((\\s+[\\w-:]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
    In java (all tags):
    String pattern = "</?(\\w+)((\\s+[\\w-:]+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";

  107. Avatar for Tibor
    Tibor November 21st, 2011

    Hi, I have a similar task.
    I want to convert to xml from html in oracle.
    <root>ff <img src="zzz"> it need <img src = "sdf >">bla bla <img title="displays >" src="big.gif">bla bla<style type="text/css">.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
    <root>ff <!--img--> it need <!--img-->bla bla <!--img-->bla bla<style type="text/css">.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
    the second one is a standard xml format. I needn't any information about the image and ie. <br> tags.
    But if I convert the
    <style type ...> to <!--img--> it will appear the mistake with the </style>
    this example your suggestion result will:
    <root>ff <!--img--> it need <!--img-->bla bla <!--img-->bla bla<!--img-->.text2 { font-size : 11px; font-family : verdana, arial; font-weight : normal></style></root>
    but that is not xml format.
    can you suggest me something?

  108. Avatar for Manu
    Manu February 1st, 2012

    <body>Do it

    this is information retrieval

    <meta name="abcd"
    content ="ebeg"

    Do it
    this is information retrieval

  109. Avatar for polas
    polas March 6th, 2012

    Hi,I want to find how to get embed like this from any website or google search regex code
    <div style="" align=""><object type="application/x-shockwave-flash" data="http://" width="" height=""><param name="" value="">
    <br />

  110. Avatar for Gayathri Padmakumar
    Gayathri Padmakumar February 25th, 2013

    i am a student and a beginner in python. Ur explanation is good


  111. Avatar for Abdul Basit
    Abdul Basit March 6th, 2013

    this regex always return me null in javascript can any body help me.  here is the code..

    $("#<%= text2.ClientID %>").focusout(function () {                     alert($(this).val().match("/\s]+))?)+\s*|\s*)/?>/"));                });
    <asp:textbox id="text2" runat="server">

  112. Avatar for Andi
    Andi March 21st, 2013

    Actually this was very useful to me ... I just needed a quick way of stripping out any HTML and only left with text in a text editor (such as VIM or Notepad++) and for that it works just fine ...

  113. Avatar for Fabiana
    Fabiana September 16th, 2013

    Hi, first of all, sorry for my english.

    I would like to use Regular Expression for comments in multiline in C#. I have @"/[*][\w\d\s]+[*]/" but with that expression only comments the text that appears between /* */ in singleline not in multiline.


    /* xxxxxxxx */



    I don't know if I could explain well, but any questions or if you can refer to somewhere that provides this information I would appreciate it.

    Thank you very much.

  114. Avatar for Dave
    Dave March 22nd, 2014

    What about the text (non-HTML) phrase:

    If BB & this is true

  115. Avatar for Havitoosh
    Havitoosh May 28th, 2014

    Unfortunately it will not match tags with enter inside an attribute, such as

    <div title="y

  116. Avatar for Rahul Vinod Sharma
    Rahul Vinod Sharma October 27th, 2014

    How can select this whole text ?


    <iframe id="preview" style="height: 202px;" width="320" height="150" scrolling="no"></iframe>
    <h4>Edit the code below & check the live preview above.</h4>

    <textarea id="code" name="code"><!DOCTYPE HTML>
    The <abbr title="HyperText Markup Language Help">HTML Help</abbr>Providing HTML5 help.
    </textarea><script>// </script>
  117. Avatar for scottSEA
    scottSEA February 18th, 2015

    I know this is older than dirt, but shame on you for concatenating all those strings in your source code. ;-)

  118. Avatar for sally
    sally March 6th, 2015

    tried this using php and preg_replace, comes up with error:

    Warning: Unknown modifier '\' in /var/www/html.php on line 20

    line 19: $reg = '\s]+))?)+\s*|\s*)/?>';

    line 20: preg_match_all($reg, $html, $matches);

    Having problems installing on

  119. Avatar for pam
    pam March 8th, 2015

    dude that happens to me all.....the time.

  120. Avatar for Obsbi
    Obsbi January 15th, 2016

    can't wait for AI to automatically generate these code. We would only to explain them what we want, they would do it.

  121. Avatar for Nick Jackson
    Nick Jackson May 22nd, 2016

    don't forget that attribute and tag names can contain the "-" and "\w" does not contain this

  122. Avatar for Jay
    Jay January 20th, 2017

    What if you come across a tag with a bunch of white space in it like this "< div />". Is there a way to modify the above regex to accommodate for this?

  123. Avatar for breccs
    breccs July 21st, 2017

    Hey, that looks good but i still have no idea how I can use regex combined with powershell to delete all instances of h1 tags including the text between the tags for example if my script finds <h1 id="This_must_go">This must go</h1>, delete it. I want to delete all the strings with h1 tag from my site because I will move it to one that will generate replacements for all h1 tags based on the file name. I know I am 13 years behind but if you can help it will be great.

  124. Avatar for Gerald Burkholder
    Gerald Burkholder November 14th, 2017

    The current expression in this post wont detect the following tag:
    <meta http-equiv="X-UA-Compatible" content="ie=edge">

    I added a check for dashes in attribute names (\w|-)+

    This gives the resulting expression:

  125. Avatar for Look
    Look December 12th, 2017

    its fantastico...!!

  126. Avatar for Zakariae Filali
    Zakariae Filali December 20th, 2017

    Just want to say big thanks that fixed a big issue where I work ;)