Named Formats Redux

Jan 14, 2009 code suggest edit

UPDATE: Be sure to read Peli’s post in which he explores all of these implementations using PEX. Apparently I have a lot more unit tests to write in order to define the expected behavior of the code.

I recently wrote a post in which I examined some existing implementations of named format methods and then described the fun I had writing a routine for implementing named formats.

In response to that post, I received two new implementations of the method that are worth calling out.

The first was sent to me by Atif Aziz, who I’ve corresponded with via email regarding Jayrock. He noticed that the slowest format method by James had one obvious performance flaw and fixed it up. He then sent me another version that passed all my unit tests. Here it is in full, and it’s much faster than before.

public static class JamesFormatter
{
  public static string JamesFormat(this string format, object source)
  {
    return FormatWith(format, null, source);
  }

  public static string FormatWith(this string format
      , IFormatProvider provider, object source)
  {
    if (format == null)
      throw new ArgumentNullException("format");

    List<object> values = new List<object>();
    string rewrittenFormat = Regex.Replace(format,
      @"(?<start>\{)+(?<property>[\w\.\[\]]+)(?<format>:[^}]+)?(?<end>\})+",
      delegate(Match m)
      {
        Group startGroup = m.Groups["start"];
        Group propertyGroup = m.Groups["property"];
        Group formatGroup = m.Groups["format"];
        Group endGroup = m.Groups["end"];

        values.Add((propertyGroup.Value == "0")
          ? source
          : Eval(source, propertyGroup.Value));

        int openings = startGroup.Captures.Count;
        int closings = endGroup.Captures.Count;

        return openings > closings || openings % 2 == 0
           ? m.Value
           : new string('{', openings) + (values.Count - 1) 
             + formatGroup.Value
             + new string('}', closings);
      },
      RegexOptions.Compiled 
      | RegexOptions.CultureInvariant 
      | RegexOptions.IgnoreCase);

    return string.Format(provider, rewrittenFormat, values.ToArray());
  }

  private static object Eval(object source, string expression)
  {
    try
    {
      return DataBinder.Eval(source, expression);
    }
    catch (HttpException e)
    {
      throw new FormatException(null, e);
    }
  }
}

The other version I got was from another Microsoftie named Henri Wiechers (no blog) who pointed out that a state machine of the sort that a parser or scanner would use, is well suited for this task. It also makes the code easier to understand if you’re accustomed to state machines.

public static class HenriFormatter
{
  private static string OutExpression(object source, string expression)
  {
    string format = "";

    int colonIndex = expression.IndexOf(':');
    if (colonIndex > 0)
    {
      format = expression.Substring(colonIndex + 1);
      expression = expression.Substring(0, colonIndex);
    }

    try
    {
      if (String.IsNullOrEmpty(format))
      {
        return (DataBinder.Eval(source, expression) ?? "").ToString();
      }
      return DataBinder.Eval(source, expression, "{0:" + format + "}") 
          ?? "";
    }
    catch (HttpException)
    {
      throw new FormatException();
    }
  }

  public static string HenriFormat(this string format, object source)
  {
    if (format == null)
    {
      throw new ArgumentNullException("format");
    }

    StringBuilder result = new StringBuilder(format.Length * 2);      

    using (var reader = new StringReader(format))
    {
      StringBuilder expression = new StringBuilder();
      int @char = -1;

      State state = State.OutsideExpression;
      do
      {
        switch (state)
        {
          case State.OutsideExpression:
            @char = reader.Read();
            switch (@char)
            {
              case -1:
                state = State.End;
                break;
              case '{':
                state = State.OnOpenBracket;
                break;
              case '}':
                state = State.OnCloseBracket;
                break;
              default:
                result.Append((char)@char);
                break;
            }
            break;
          case State.OnOpenBracket:
            @char = reader.Read();
            switch (@char)
            {
              case -1:
                throw new FormatException();
              case '{':
                result.Append('{');
                state = State.OutsideExpression;
                break;
              default:
                expression.Append((char)@char);
                state = State.InsideExpression;
                break;
            }
            break;
          case State.InsideExpression:
            @char = reader.Read();
            switch (@char)
            {
              case -1:
                throw new FormatException();
              case '}':
                result.Append(OutExpression(source, expression.ToString()));
                expression.Length = 0;
                state = State.OutsideExpression;
                break;
              default:
                expression.Append((char)@char);
                break;
            }
            break;
          case State.OnCloseBracket:
            @char = reader.Read();
            switch (@char)
            {
              case '}':
                result.Append('}');
                state = State.OutsideExpression;
                break;
              default:
                throw new FormatException();
            }
            break;
          default:
            throw new InvalidOperationException("Invalid state.");
        }
      } while (state != State.End);
    }

    return result.ToString();
  }

  private enum State
  {
    OutsideExpression,
    OnOpenBracket,
    InsideExpression,
    OnCloseBracket,
    End
  }
}

As before, I wanted to emphasize that this was not a performance contest. I was interested in correctness with reasonable performance.

Now that the JamesFormatter is up to speed, I was able to crank up the number of iterations to 50,000 without testing my patience and tried it again. I also made the format string slightly longer. Here are the new results.

As you can see, they are all very close, though Henri’s method is the fastest.

I went ahead and updated the solution I had uploaded last time so you can try it yourself by downloading it now.

Found a typo or mistake in the post? suggest edit

Comments

23 responses

Joe Chung • January 14th, 2009
I really like state machines, but it's one algorithm that is in desperate need of a cleaner syntax to capture its semantics.
Maybe one of these days, we'll get proper support for continuations in .NET. There's already tail call optimization support in the CLR, but no CLR language seems to support it yet (except maybe F#?).
brad • January 14th, 2009
Doing stuff like this reminds you just how fast computers actually are. 50k formats in <1ms WORST case really makes it hurt that much more when you have to wait for VS to nuke your processor for 30 or 60 seconds doing some seemingly simple task.
Peli • January 14th, 2009
An interresting thing to do when you have multiple implementations that share the same specification is to compare their outcome. This is really a great scenario to apply tools such as Pex.
For example, in this case, I wrote this simple parameterized unit test:
[PexMethod]
public void HenriVsHaacked(string format, object o)
{
// checks that return value or exception caught match
PexAssert.AreBehaviorsEqual(
() => HenriFormatter.HenriFormat(format, o),
() => HaackFormatter.HaackFormat(format, o)
);
}
The 5-th test case that Pex generates is very interresting: '}'. Sounds innocent enough but the two formatteres don't agree on the outcome for this format string and triggers the assert:
"left threw a System.FormatException exception and right did not."
In fact, the Henri formatter throws a format exception while Phil's happily survives it.
Peli • January 15th, 2009
After more investigation, it turns out none of the implementations match... see blog.dotnetwiki.org/.../...matsPexTestimonium.aspx .
Atif Aziz • January 15th, 2009
Implementing a table-based state machine by hand, as Henri has done it, is a bit unnecessary given yield in C# (as you've done it in your version). Code using yield is almost always (given certain complexity) easier to write, read, understand, maintain and extend. It is also more concurrent in nature because its easier to avoid side effects. The heavy-lifting of creating an optimal and table-based state machine out of such yielding code can be left as an exercise for the compiler folks. :) Fortunately, that's what already happens today.
I'm not sure if the number of states and complexity of the problem warrants modeling it in terms of an explicit state machine. A simpler implementation with the usual control suspects of the language will do. I think you'll find the following version shorter, simpler and easier to read, follow and debug:
http://gist.github.com/47888
Henri • January 15th, 2009

I'm not sure if the number of states and complexity of the problem warrants modeling it in terms of an explicit state machine. A simpler implementation with the usual control suspects of the language will do. I think you'll find the following version shorter, simpler and easier to read, follow and debug:

To me, a simple state machine is easier to understand than a mishmash of loops and branching. These things are subjective, so I encourage people to compare them and decide for themselves.
Henri • January 15th, 2009
The new version of JamesFormatter fails on "{{x}}".
Kelly Stuard • January 16th, 2009
Notes:
* Only the HanselFormat and OskarFormat work with IDictionary passed as the object.
* HanselFormat is nice enough to not throw a format exception if{Title} has no corresponding Title property passed in. This is different from string.Format, but nice if i want to pass my string through multiple dictionaries / anonymous classes / classes.
Atif Aziz • January 16th, 2009
@Henri:

The new version of JamesFormatter fails on "{{x}}".

Yep and there are probably other failing cases not covered by the tests. See my comment to Richard in the earlier post.
In the end, I think your implementation is the best in terms of speed and simplicity. And as you said, whether its simplicity lies in a table-based state machine approach or regular control flow constructs of the language can be left up to the eye and mind of the beholder.
Someone out to nudge the BCL team to add this functionality to String.Format in .NET Framework 4.0 timeframe. There are some great ideas in PEP 3101 that specifies advanced string formatting as it may become available with Python 3000. It would be very useful as dynamic languages and dynamic features in existing languages like C# become more common on the CLR.
Atif Aziz • January 16th, 2009
Oops, the link behind "my comment to Richard in the earlier post" in the previous comment is incorrect. The right one is:
my comment to Richard in the earlier post
Jame sHart • January 18th, 2009
I'd been mooching around with some ideas for a Linq-expression based format expression 'compiler' approach for a while; never got around to actually building it. So when you posted up this test suite it seemed like it might help me prototype something up.
Was surprised at the result - see my blog entry on the subject at blogs.ipona.com/james/archive/2009/01/16/8557.aspx.
roboto • January 18th, 2009
As I had an implementation for this before, I adopted it to your StringLib to enter the 'competition' :)
http://geekswithblogs.net/NotImpossible/archive/2009/01/19/named-format-strings.aspx
What was interesting as well, is a comparison with old fashioned string.Format("{0} {1}...").
Atif Aziz • January 20th, 2009
@Kelly Stuard:

Only the HanselFormat and OskarFormat work with IDictionary passed as the object.

Any implementation that uses DataBinder.Eval, as the other formatters do, can work with dictionaries (or any other type for that matter) provided that the supplied source has an ICustomTypeDescriptor implementation available. DataBinder.Eval internally relies on ICustomTypeDescriptor to resolve the property names and subsequently acquire their values. As a result, those formatters are quite open and don't need any further modification. In my projects, I have a keyed-collection implementation that also implements ICustomTypeDescriptor and which can be passed on to such formatters just fine. The code can be found below:
DictionaryObject.cs
Jon • January 20th, 2009
I think it's unfair to compare the implementations using DataBinder.Eval with those just doing basic reflection on a type's properties. Eval supports nested properties, indexers, etc, but will inevitably be somewhat slower.
For what it's worth, here's my attempt at an implementation: http://www.amtiskaw.co.uk/JonFormatter.txt
I tried to do it like I would have in 1st year CompSci. I convert the string to a char array and step through with a for loop, building up a StringFormatter result as I go. Turned out a couple of hundredths of a second faster than Henri's.
Jon • January 20th, 2009
Also: Your comments system seems to be lying about what tags it allows!
Craig Stuntz • January 22nd, 2009
Sorry, this has nothing to do with Named Formats, but I couldn't find a post about TempData that was less than a year old.
After answering many questions about TempData in ASP.NET MVC, I have a proposal to rename it. I think that the name "RedirectData" will eliminate most of the confusion that people have about this property.
blogs.teamb.com/craigstuntz/2009/01/23/37947/
I know you're getting close to release, but what do you think?
Andy Scott • January 25th, 2009
Hey Phil,
Great post - just wondering whether there is any news on the MVC Release Candidate ?
All news have been quite on this front and just trying to plan in some time for upgrade and reading over the new docs/changes etc for my site ?
Thanks!
Andy
Samen Trouwen • March 3rd, 2009
Craig, that clarification made it clear to me. It probably will be a good idea! :)
dotnetchris • March 23rd, 2012
I created a fork of Atif Aziz's version of the Henriformatter: https://gist.github.com/2170953
This adds that you can use anonymous object, IDictionary<string,object>, and RouteData as all valid sources for the Heniformat source.
Hmmm I'm thinking maybe this should be uploaded as a nuget package.
Zaus • September 21st, 2012
Yet another attempt: https://gist.github.com/3763138
I was curious, and set up my own tests using some randomly generated string-masks (https://gist.github.com/3763223) and I get different test results. Granted I'm using a homegrown benchmark framework (all it's doing is running the Action X times and using Stopwatch), but I thought the results were kinda surprising -- seems the performance somewhat depends on the size of the test set (tokens + string).
Mladen Mihajlovic • September 14th, 2013
Another one here http://www.codeproject.com/...
FEBit • March 27th, 2015
Exception Details: System.MissingMethodException: Method 'System.String.ToString' not found.

Source Error:

Line 75: else //format info
Line 76: {
Line 77: result = retrievedType.InvokeMember("ToString",
Line 78: BindingFlags.Public | BindingFlags.NonPublic |
Line 79:
Ryan • March 22nd, 2016
Another one joins the fight: https://github.com/crozone/...

With NuGet: https://www.nuget.org/packa...

I wrote this as a state machine that tokenizes the input into a List of {Index, Length} with reference to the original format string. The tokens then get looped over, any tokens that are classified as parameters are handled, and everything is fed directly into a StringBuilder without any intermediate SubString() operations. The tokenizer can also take custom characters for the opening and closing braces (like '<' and '>'), and custom behaviour can be specified for how the parameters are handled.

It also doesn't rely on DataBinder.Eval(), so it's compliant with .NET Standard/Core. Feedback would be appreciated if anyone wants to give it a spin!