comments edit

Pat Gannon (no blog) makes a great point in the comments on my post about using regular expressions to parse HTML. He says:

Just to play devil’s advocate for a minute, it seems like HTML is just too darned close to XML to have to parse this way. Isn’t there a library out there for converting HTML into XHTML? If you can do that, you can just read the file in using XmlDocument::LoadXml(). Once you’ve done that, you can find your tags using an XPath query. Sorry, I just couldn’t let a parsing post go by without tossing in my two cents ;)

In fact, there are two approaches to this. The first recognizes that HTML is really just a subset of SGML. Thus if you have a SGML parser, you’re done. So one option is to try Chris Lovett’s SgmlReader.

In fact, this is what the current version of RSS Bandit uses for auto-discovery of RSS feeds within HTML content. However, I recently replaced it with regular expressions because of some memory use and performance problems we were having with it. In our case, finding these tags is a lot faster and uses less memory by just using a regular expression. (Now you see the motivation for the post).

Another option is to use Simon Mourier’s HTML Agility Pack. He takes an interesting approach in that he provides an HtmlDocument class that implements System.Xml.XPath.IXPathNavigable. Thus his approach provides the same interface as an XmlDocument for querying nodes, but doesn’t change the underlying HTML content as many other approaches would by converting them to XML.

And just to toot Pat’s horn a bit, I used to be his manager at Solien when he was just starting out in his career. Now he works at Univision and has inherited reams of code that parse through Fortran code as well as proprietary database files. He’s also written his own grammar engine and xml syntax for describing computer languages such as C#. So he knows a thing or two about parsing text. He’s become quite a top notch developer. I’m just waiting for him to get off his arse and start a blog.

code, regex comments edit

I just love regular expressions. I mean look at the sample below.


What’s not to like?

Ok I admit, I was a bit intimidated by regular expressions when I first started off as a developer. All I needed was a Substring method and an IndexOf method and I was set. But after a few projects that required some intense text processing, I realized the power and utility of regular expressions. They should be on the tool belt of every developer. To that end, I recommend Mastering Regular Expressions by Jeffrey Friedl. This is really THE book on Regular Expressions. Reading it will make your Regex-Fu powerful.

So let’s look at a common task of matching HTML tags within the body of some text. When you initially think to parse an HTML tag, it seems quite easy. You might consider the following expression:


Roughly Translated, this expression looks for the beginning tag and tag name, followed by some white-space and then anything that doesn’t end the tag.

Now this will probably work 99 times out of 100, but there’s a flaw in this expression. Do you see it? What if I asked you to match the following tag?

<img title="displays >" src="big.gif">

Hopefully you see the issue here. The expression will match

<img title="displays >

Unfortunately, this implementation is too naive. We have to consider the fact that the greater-than symbol does not end a tag if it’s within a quoted attribute value. Thus we must correctly match attributes.

Now there are four possible formats for an Html attribute

name="double quoted value" name='single quoted value' name=notquotedvaluewithnowhitespace name

Each of these cases are quite simple. In the first case, you could do the following:


The portion "[\^"]*" matches a double quote, followed by any non double quote characters, followed by a double quote. Another way to express this is to use lazy evaluation as such:


The portion ".*?" uses lazy evaluation (the “lazy star”) to match as few characters as possible. For example, if we had a string like so

<a name=test value="test2">

evaluating ".*" (aka greedy) would match

"test" value="test2"

However using the lazy evaluation consumes the fewest characters that match the expression, thus the first match using ".*?" would be "test" and the second match is "test2".

The full expression for matching an HTML tag is that lovely mash of characters presented at the very beginning of this post. It’s a modified version of the one presented in Friedl’s book

However I wouldn’t recommend you just plunk that down in your code. Rather, you should consider adding it to a regular expression library assembly.

Don’t know how? Well I’ll show you a code listing for an exe that when run, builds a fully compiled version of this regular expression into an assembly that you can then reference in any project. In a later installment, I’ll explain in more detail just what the code is doing and how to use the compiled assembly. How irresponsible of me not to do that now. ;)

Source Listing

comments edit

Weird. I did a google search for an entry in my blog and one of the results was a bloglines account that had my blog subscribed. I was basically seeing all the blogs that some bloglines user was subscribed to. Is that a feature of bloglines to expose your subscriptions like that? Or is that a privacy flaw?

UPDATE: Nevermind. I’m just being paranoid. Bloglines supports public profiles.

comments edit

Dare Obasanjo, the project lead on the RSS Bandit project (of which I contribute) is leaving his post as a Program Manager on the XML team at Microsoft to work as a Program Manager on the MSN Communication Services Platform team.

When Microsoft revealed a blogging service similar to Blogger, I had a feeling it was only a matter of time before Dare would somehow be involved with that seeing his interest in Social Software.

It will be interesting to see the direction Microsoft takes with social software. Although Microsoft perhaps doesn’t see entering the aggregator market as a profit center, I wouldn’t be surprised if that changes in the next year or so.

As aggregation continues to take off, it seems natural to incorporate it into Office. Remember that “Information At Your Fingertips” mantra Mr. Gates touted a while ago? Well I get most of my online information through two sources, Google and RSS Bandit.

In any case, I wish Dare well. Hopefully this is the platform for him to have some of his ideas implemented. I have to admit, I’d love to work on social software such as RSS Bandit and .TEXT full time. But I have a mortgage to pay.

comments edit

Since I like to stoke the fire of partisanship… This joke was sent to me by my friend Walter.

George Bush meets with the Queen of England. He asks her, “Your Majesty, how do you run such an efficient government? Are there any tips you can give to me?”

“Well,” says the Queen, “the most important thing is to surround yourself with intelligent people.”

“Bush frowns. “But how do I know the people around me are really intelligent?”

The Queen takes a sip of tea. “Oh, that’s easy. You just ask them to answer an intelligent riddle. “ The Queen pushes a button on her intercom. “Please send Tony Blair in here, would you?”

Tony Blair walks into the room. “Yes, my Queen?”

The Queen smiles. “Answer me this, please, Tony. Your mother and father have a child. It is not your brother and it is not your sister. Who is it?”

Without pausing for a moment, Tony Blair answers, “That would be me.”

“Yes! Very good,” says the Queen.

Bush goes back home to ask Dick Cheney, his vice president, the same question.

“Dick, answer this for me. Your mother and your father have a child. It’s not your brother and it’s not your sister. Who is it?”

“I’m not sure,” says Cheney, “ let me get back to you on that one.”

Cheney goes to his advisors and asks every one, but none can give him an answer. Finally, he ends up in the men’s room and recognizes Colin Powell’s shoes in the next stall. Cheney shouts, “Colin! Can you answer this for me? Your mother and father have a child and it’s not your brother or your sister. Who is it?”

Colin Powell yells back, “That’s easy. It’s me!”

Cheney smiles, and says, “Thanks!” Then, Cheney goes back to speak with Bush. “Say, I did some research and I have the answer to that riddle. It’s Colin Powell.”

Bush gets up, stomps over to Cheney and angrily yells into his face, “No, you idiot! It’s Tony Blair!”

comments edit

Scott Guthrie has returned to blogging with a tremendous piece on his team’s effort towards reaching “ZBB” or Zero Bug Bounce.

I’ve personally never worked on software project as large as the ASP.NET 2.0 project, so it’s fascinating for me to read Scott’s description of the testing and check-in process. Typically, my check-in process is to get latest on any files I didn’t change, build, and run my unit tests. Assuming everything passes, I check in my files, get latest again build, and run the the unit tests again. If everything still passes, I’m done with the check-in. If all went smoothly, it’s all done under half an hour.

For the ASP.NET team, every check-in undergoes peer review and is run through a few hours of checkin test suites. They then run more exhaustive nightly tests over the product to catch issues in the latest builds. That’s pretty impressive.

code, tdd comments edit

Jonathan de Halleux, aka Peli, never ceases to impress me with his innovations within MbUnit. In case you’re not familiar with MbUnit, it’s a unit testing framework similar to NUnit.

The difference is that while NUnit seems to have stagnated, Jonathan is constantly innovating new features, test fixtures, etc… for a complete unit testing solution. In fact, he’s even made it so that you can run your NUnit tests within MbUnit without a recompile.

His latest feature is not necessarily a mind blower, but it’s definitely will save me a lot of time writing the same type of code over and over for testing a range of values. I’ll just show you a code snippet and you can figure out what it’s doing for you.


[TestFixture]public class DivisionFixture{    [RowTest]    [Row(1000,10,100.0000)]    [Row(-1000,10,-100.0000)]    [Row(1000,7,142.85715)]    [Row(1000,0.00001,100000000)]    [Row(4195835,3145729,1.3338196)]    public void DivTest(double num, double den, double res)    {        Assert.AreEqual(res, num / den, 0.00001 );    }}


And if you’re anal like me and wondering why I chose “num” instead of “numerator” etc… Purely for blog formatting reasons. ;)

UPDATE: Jonathan points out that negative assertions are also supported. Here’s an illustrative code snippet. I can’t wait to try this out.


[RowTest] [Row(1000,10,100.0000)] ... [Row(1,0,0, ExpectedException =              typeof(ArithmeticException))] public void DivTest(double num, double den, double res) {...} 

comments edit

Xclef Saw this on Gizmodo. It’s bigger and not as nice looking as an iPod, but it is 100 GB.

The DMC Xclef 500 also supports Ogg Vorbis and even WAV—with a 100GB drive, you could start ripping your CDs with no compression at all. The 100GB version is $450 from DMC’s online store.

humor comments edit

My friend Michael who lives in London for now sent me this.

Once again, The Washington Post published its yearly contest in which readers are asked to supply alternate meanings for various words (& leave it to the Post to search for new meanings).

And the winners are …

​1. Coffee (n.), a person who is coughed upon.

​2. Flabbergasted (adj.), appalled over how much weight you have gained.

​3. Abdicate (v.), to give up all hope of ever having a flat stomach.

​4. Esplanade (v.), to attempt an explanation while drunk.

​5. Willy-nilly (adj.), impotent.

​6. Negligent (adj.), describes a condition in which you absentmindedly\       answer the door in your nightgown.

​7. Lymph (v.), to walk with a lisp.

​8. Gargoyle (n.), an olive-flavored mouthwash.

​9. Flatulence (n.) the emergency vehicle that picks you up after\      you are run over by a steamroller

​10. Balderdash (n.), a rapidly receding hairline.

​11. Testicle (n.), a humorous question on an exam.

​12. Rectitude (n.), the formal, dignified demeanor assumed by a\       proctologist immediately before he examines you.

​13. Oyster (n.), a person who sprinkles his conversation with\       Yiddish expressions.

​14. Pokemon (n), A Jamaican proctologist.

​15. Frisbeetarianism (n.), The belief that, when you die your Soul\       goes up on the roof and gets stuck there.

​16. Circumvent (n.), the opening in the front of boxer shorts

comments edit

I know a lot of people like to post picturesof their workspace online. Not sure why (vanity!), but they just do. So I thought I’d jump on that bandwagon and do the same.

This first picture shows my work office with it’s nice 17th floor view.

Work Office \ Strange green bands take over the screens.

This next one is our home office.

\ If you look carefully, you’ll notice the hastily minimized porn application.

As you can see, the home setup is much nicer than the work setup with dual 17” flat panels monitors, and a slick looking aluminum Shuttle case. I wish my company would invest in nice monitors. My work monitors flicker, make me cross-eyed and spit in my food. If you look closely at the top picture, the computer case is literally held together with scotch tape on top. The IT department wouldn’t budget duct tape.

The little figurine on top is the “Buddy Christ” from Dogma. You can purchase that on Kevin Smith’s website. My wife painted the red shoe on the left.

comments edit

Allow users to configure Google Desktop to search their GMail accounts. Most of my personal email isn’t going to be in Outlook. It’ll be in my web-based accounts.

comments edit

Copernic LogoAfter reading the reaction around the net about Google Desktop (GD for short), one common complaint I noticed is the use of a web browser for local searching. Why use a web browser to search locally and forego all the utility and benefits a rich client can provide?

So I thought I’d start trying out some of the free alternatives to GD. One that is mentioned quite often is Copernic. I decided to uninstall GD and give it a whirl.

So far, I’m not quite satisfied. I left it running overnight and it’s still not done indexing my hard drive. Not only that, while it’s indexing, my computer runs at a snail’s pace at times. I often have to restart it to reclaim my computer’s resources. In comparison, GD finished indexing within a much shorter time span with a nearly imperceptible impact on my computer’s performance.

One area where Copernic shines above GD is the UI. Copernic provides options for refining the search parameters just below the search input. When you search for emails, the search result window breaks down the results by date. You can see them grouped into emails received Today, Yesterday, Last Week, Last Month, This Year, and so on…

One shortcoming that both engines share is the inability to specify file types to search other than the preconfigured ones. For example, I would like to search my C# files that have the .cs extension. No can do.

So the search continues. I could shell out for X1, but I’d like to find a free product I can use at both work and home. I read about another product to try at home, but forgot its name. In any case, I’ll keep you posted.

UPDATE: Whoops! Apparently you can configure Copernic to index arbitrary file types through the advanced options dialog. Thanks to Eric for the tip.

comments edit

A strange phenomena occurred last night and into this morning here in Los Angeles. I kid you not, but water… fell from the SKY!

Near panic ensued throughout the city as Angelenos tried to make sense of this unexpected situation. My league soccer game was cancelled due to poor field conditions (aka mud).

Fortunately, I had a backup plan. Every Saturday and Sunday I have a pick-up game with a wonderful group of people. The field we play on was in need of water and by 2 PM was in perfect condition. Water even fell from the sky again in the midst of our game. It was a beautiful experience.

Afterwards, my brother-in-law took us out for dinner at The Lobster. Oh man, is that food ever good! Try the scallops.

comments edit

Saw this post by Craig Andera about Test Driven Development and I have to say I completely agree with him.

I’ve been doing TDD for several years now and I tend to restrict it to testing business and data access layers. Currently, it’s not practical to perform comprehensive TDD for UI layers (though tools like NUnitAsp and NUnitForms help).

Even these tools don’t address one of the biggest challenges when it comes to testing a UI layer. The UI layer is the layer most likely to change and change often. After spending a few hours building your tests, some guy in marketing will call you up and say, “can we replace that button with a table of data with clickable rows?” What now?

Unit tests tend to be quite fragile when faced with a changing UI layer. Human testers have no problem dealing with such change, but your unit tests definitely will.

My recommendation for testing the UI layer are combining test scripts (for human testers) along with writing unit tests when a bug is found in the UI layer. At that point the UI should hopefully be frozen enough that writing a unit test that exposes a bug and then fixing the bug will be a worthwhile investment for regression purposes.

comments edit

FastCompany Read this article at FastCompany pointed to by Steve Maine and maybe I’m lazy, but I totally disagree.

The article makes the point that the concept of “work-life balance” is a pipe dream. What the article fails to mention are the associated health problems for many workaholics. For programmers in particular, ailments such as RSI are common (though many programmers such as myself count programming as a hobby as well which would also contribute).

The article briefly dismisses the European notion of “Working To Live”. I think in doing so, it fails to address the societal and cultural issues that often drive a work-life imbalance. How successful is this notion of succeeding at all costs as a source of fulfillment? The article mentions that imbalance is required to gain real productivity, but is that the measure of one’s success?

It’s well documented that Americans tend to spend their hard earned money on things and possesions while Europeans spend more on vacations and events. Should work be the primary defining character trait of a person? In the U.S., the first question in any social setting is “What do you do?”. In many European countries, it’s a social gaffe to ask that of a stranger. Why not ask, “What do you like to do?”.

It’s my contention that this single minded focus on materialism (and I’m not totally against materialism as I LOVE my iPod) is the driving force behind working too much. If one were to step back and look at what really gives one fulfillment, I think priorities will often be rearranged. Not that I’m against working hard. I love to write code and read books about coding and software management in my spare time. However, I also realize the value of defining myself along other interests as well. I realize the value of maintaining my health via excercise and of my mental health through maintaining meaningful relationships with friends and family.

It reminds me of something I’ve heard somewhere or other. How often do you hear people in the twilight of their lives or on their death beds reflect on the wonderful time they spent at the office?

comments edit

Bush PraysSaw this going around the web. Classic!

Dear President Bush, Thank you for doing so much to educate people regarding God’s law. I have learned a great deal from you and try to share that knowledge with as many people as I can. When someone tries to defend the homosexual lifestyle, for example, I simply remind them that Leviticus 18:22 clearly states it to be an abomination. End of debate. I do need some advice from you, however, regarding some other elements of God’s Laws and how to follow them:

​1. Leviticus 25:44 states that I may possess slaves, both male and female, provided they are purchased from neighboring nations. A friend of mine claims that this applies to Mexicans, but not to Canadians. Can you clarify? Why can’t I own Canadians?

​2. I would like to sell my daughter into slavery, as sanctioned in Exodus 21:7. In this day and age, what do you think would be a fair price for her?

​3. I know that I am allowed no contact with a woman while she is in her period of menstrual uncleanliness (Lev. 15:19-24). The problem is, how do I tell? I have tried asking, but most women take offense.

​4. When I burn a bull on the altar as a sacrifice, I know it creates a pleasing odor for the Lord (Lev. 1:9). The problem is my neighbors. They claim the odor is not pleasing to them. Should I smite them?

​5. I have a neighbor who insists on working on the Sabbath. Exodus 35:2 clearly states that he should be put to death. Am I morally obligated to kill him myself, or should I ask the police to do it?

​6. A friend of mine feels that, even though eating shellfish is an abomination (Lev. 11:10), it is a lesser abomination than homosexuality. I don’t agree. Can you settle this? Are there “degrees” of abomination?

​7. Lev. 21:20 states that I may not approach the altar of God if I have a defect in my sight. I have to admit that I wear reading glasses. Does my vision have to be 20/20, or is there some wiggle-room here?

​8. Most of my male friends get their hair trimmed, including the hair around their temples, even though this is expressly forbidden by Lev. 19:27. How should they die?

​9. I know from Lev. 11:6-8 that touching the skin of a dead pig makes me unclean, but may I still play football if I wear gloves?

​10. My uncle has a farm. He violates Lev. 19:19 by planting two different crops in the same field, as does his wife by wearing garments made of two different kinds of thread (cotton/polyester blend). He also tends to curse and blaspheme a lot. Is it really necessary that we go to all the trouble of getting the whole town together to stone them (Lev. 24:10-16)? Couldn’t we just burn them to death at a private family affair, like we do with people who sleep with their in-laws (Lev. 20:14)?

I know you have studied these things extensively and thus enjoy considerable expertise in such matters, so I am confident you can help.

comments edit

Time to get political. I loved how Kerry caught Bush off-guard when he pointed out that in 2002, Bush said he wasn’t concerned with Osama. Realize this is at the same time our troops were in Afghanistan looking for the bastard. I think that beats the out of context “nuisance” quote by a mile, because even in context, Bush’s quote is damning.

comments edit

Colin sent me an email pointing me to an add-in he wrote for VS.NET that allows you to copy selected source code to the clipboard as syntax highlighted HTML.

By selecting some code and right clicking the code editor, you’ll see an option to Copy Source as HTML.

Selecting that menu item brings up a dialogue where you can configure some options. It’s based on my favorite code to HTML formatter by Manoli.

Below is an example of a code snippet using this tool:

/// Sets the stack trace for the given lock target 
/// if an error occurred.
/// </SUMMARY>
/// <PARAM name="lockTarget">Lock target.</PARAM>
public static void ReportStackTrace(object lockTarget)
        ManualResetEvent waitHandle = 
                 as ManualResetEvent;
        if(waitHandle != null)
        _failedLockTargets[lockTarget] = new StackTrace();
        //TODO: Now's a good time to use your
        //favorite logging framework.

Colin, thanks for pointing me to this. This is freakin’ awesome!

Now, if I could have a short-cut that would use the default options and immediately put the selected source in the clip-board, that would just rock my world. Also, one minor niggle I’ve had with the Manoli formatter is that the xml comments tags and the triplle slashes (such as /// ) should be gray to mimic VS.NET instead of all green. How hard would it be to fix that?