Hidden Code Mines

Jun 17, 2013 code suggest edit

Code is unforgiving. As the reasonable human beings that we are, when we review code we both know what the author intends. But computers can’t wait to Well, Actually all over that code like a lonely Hacker News commenter:

Well Actually, Dave. I’m afraid I can’t do that.

Hal, paraphrased from 2001: A Space Odyssey

As an aside, imagine the post-mortem review of that code!

Code review is a tricky business. Code is full of hidden mines that lay dormant while you test just to explode in a debris of stack trace at the most inopportune time – when its in the hands of your users.

The many times I’ve run into such mines just reinforce how important it is to write code that is intention revealing and to make sure assumptions are documented via asserts.

Such devious code is often the most innocuous looking code. Let me give one example I ran into the other day. I was fortunate to defuse this mine while testing.

This example makes use of the Enumerable.ToDictionary method that turns a sequence into a dictionary. You supply an expression to produce a key for each element. In this example, loosely based on the actual code, I am using the CloneUrl property of Repository as the key of the dictionary.

IEnumerable<Repository> repositories = GetRepositories();
repositories.ToDictionary(r => r.CloneUrl);

It’s so easy to gloss over this line during a code review and not think twice about it. But you probably see where this is going.

While I was testing I was lucky to run into the following exception:

System.ArgumentException: 
An item with the same key has already been added.

Doh! There’s an implicit assumption in this code – that two repositories cannot have the same CloneUrl. In retrospect, it’s obvious that’s not the case.

Let’s simplify this example.

var items = new[]
{
    new {Id = 1}, 
    new {Id = 2}, 
    new {Id = 2}, 
    new {Id = 3}
};
items.ToDictionary(item => item.Id);

This example attempts to create a dictionary of anonymous types using the Id property as a key, but we have a duplicate, so we get an exception.

What are our options?

Well, it depends on what you need. Perhaps what you really want is a dictionary that where the value contains every item with the given key. The Enumerable.GroupBy method comes in handy here.

Perhaps you only care about the first value for a given key and want to ignore any others. The Enumerable.GroupBy method comes in handy in this case.

In the following example, we use this method to group the items by Id. This results in a sequence of IGrouping elements, one for each Id. We can then take advantage of a second parameter of ToDictionary and simply grab the first item in the group.

items.GroupBy(item => item.Id)
  .ToDictionary(group => group.Key, group => group.First());

This feels sloppy to me. There is too much potential for this to cover up a latent bug. Why should the other items be ignored? Perhaps, as in my original example, it’s fully normal to have more than one element for the key and you should handle that properly. Instead of grabbing the first item from the group, we retrieve an array.

items.GroupBy(item => item.Id)
  .ToDictionary(group => group.Key, group => group.ToArray());

In this case, we end up with a dictionary of arrays.

UPDATE: Or, as Matt Ellis points out in the comments, you could use theEnumerable.ToLookupmethod. I should have known such a thing would exist. It’s exactly what I need for my particular situation here.

What if having more than one element with the same key is not expected and should throw an exception. Well you could just use the normal ToDictionary method since it will throw an exception. But that exception is unhelpful. It doesn’t have the information we probably want. For example, you just might want to know, which key was already added as the following demonstrates:

items.GroupBy(item => item.Id)
    .ToDictionary(group => group.Key, group =>
    {
        try
        {
            return group.Single();
        }
        catch (InvalidOperationException)
        {
            throw new InvalidOperationException("Duplicate
  item with the key '" + group.First().Id + "'");
        }
    });

In this example, if a key has more than one element associated with it, we throw a more helpful exception message.

System.InvalidOperationException: Duplicate item with the
key '2'

In fact, we can encapsulate this into our own better extension method.

public static Dictionary<TKey, TSource>
  ToDictionaryBetter<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)
{
  return source.GroupBy(keySelector)
    .ToDictionary(group => group.Key, group =>
    {
      try
      {
        return group.Single();
      }
      catch (InvalidOperationException)
      {
        throw new InvalidOperationException(
            string.Format("Duplicate item with the key
          '{0}'", keySelector(@group.First())));
      }
    });
}

Code mine mitigated!

This is just one example of a potential code mine that might go unnoticed during a code review if you’re not careful.

Now, when I review code and see a call to ToDictionary, I make a mental note to verify the assumption that the key selector must never lead to duplicates.

When I write such code, I’ll use one of the techniques I mentioned above to make my intentions more clear. Or I’ll embed my assumptions into the code with a debug assert that proves that the items cannot have a duplicate key. This makes it clear to the next reviewer that this code will not break for this reason. This code still might not open the hatch, but at least it won’t have a duplicate key exception.

If I search through my code, I will find many other examples of potential code mines. What are some examples that you can think of? What mines do you look for when reviewing code?

Found a typo or mistake in the post? suggest edit

Comments

26 responses

Matt Ellis • June 17th, 2013
Perhaps not the point of the post, but, well, actually, Enumerable.ToLookup.
Robert Schroeder • June 17th, 2013
mutability -- is it necessary, how to limit it, how to ensure it is valid.
unintended state - can the stateful variables be set to invalid combinations (if a == true, b SHOULD be x, y , or x, but in some cases could be NULL).
Timothy Boyce • June 17th, 2013
Lots of things.. first one that comes to mind is modifying a collection while enumerating it.
haacked • June 17th, 2013
How did I not know about this!? Thanks! I've updated the post.
haacked • June 17th, 2013
Ah, but it doesn't look like exactly what I want. Unless I'm mistaken, it returns an ILookup which doesn't have a method to get the values for a key.
Kern • June 17th, 2013
http://msdn.microsoft.com/e...
"Properties:
Item - Gets the collection of values indexed by the specified key."
Matt Ellis • June 17th, 2013
Yeah. Very cool stuff. If I remember rightly, this will return an empty sequence if the key isn't found. I do love me some linq.
Mauricio Scheffer • June 17th, 2013
I think you're mostly talking about partial functions here. There's tons of partial functions in the BCL, you just need to be aware of them and mostly avoid them, using total functions instead by default. For the particular case of ToDictionary, as others have said, ToLookup is similar but total. As its signature shows, it addresses the case of duplicate keys.
And of course, it's not enough to be aware of existing partial functions, also be very wary of *producing* partial functions.
Timothy Boyce • June 17th, 2013
What exactly do you mean by partial function? ToDictionary does exactly what I would expect it to do. Perhaps the exception could be more helpful when you give it bad data. If you really need a Dictionary, then ToLookup isn't going to work.
haacked • June 17th, 2013
Ok, now I feel stupid. I should have read the docs more carefully.
haacked • June 17th, 2013
Yeah, that's cool. So there's no need to do a TryGet.
Mauricio Scheffer • June 17th, 2013
Definition of partial function: http://en.wikipedia.org/wik...

The Haskell wiki has an easier definition for us programmers: http://www.haskell.org/hask...

If you don't like ToLookup and want a dictionary as result, you could easily write a similar total function returning something like IReadOnlyDictionary<tkey, ireadonlycollection<tvalue="">>.

What you refer to "bad data" is an element for which ToDictionary is not defined. This is what makes ToDictionary a partial function. But if you change the result type (the function codomain), then you can make the function total, i.e. it will be defined for all values in the domain. It's not the data's fault, it's ToDictionary's fault.

I briefly touched on the subject of partial functions in this blog post: http://bugsquash.blogspot.c...
Leyu • June 18th, 2013
Typo on second snnipet
//items.ToDictionary(item => item.Key);
items.ToDictionary(item => item.Id);
FennNaten • June 18th, 2013
the lack of null check... I think the wrong assumption I see the most
often in code is "this function which returns an object will never
return null". Especially with linq. And in most of the cases, it really
shouldn't. But it does. If I take your example:

IEnumerable<repository> repositories = GetRepositories();
repositories.ToDictionary(r => r.CloneUrl);

Here it depends where GetRepositories come from and how much you trust the source.
Because I've seen too many people doing stuff like that:

public IEnumerable GetRepositories(){
try{
//nothing should go wrong here, but we will try catch, just in case
return GetAllRepositoriesWhichCanBeNullToo().Where(x => SomeStuffWithAPotentialUnseenCrash(x) == true);
}
catch(Exception ex){
//we don't know how to return an empty enumerable
//nevermind, that will never happend anyaway. let's just return null! And no need to write it in assembly doc!
//And don't rethrow any meaningful exception cause we promised crash free code to our client.
return null;
}
}

Another big classic: "this will never be 0, let's divide"
In
the series of nasty ones: the time related. Like "nobody will ever run
that in another timezone, let's use date as part of the key"
haacked • June 18th, 2013
Thanks! Fixed
Mauricio Scheffer • June 18th, 2013
On second thought, it's not necessary to change the codomain to make it total: you could just keep the first of the duplicate keys. Of course, this would lead to data loss, which sometimes you'd want and sometimes not, but it's not as general as keeping all the data and then deciding whether to drop it or not.
haacked • June 18th, 2013
Yeah, I hinted at that, but I don't find it satisfying. If you drop values, chances are you are hiding a bug.

I'd prefer an exception, or the ToLookup approach that makes it clear there could be more than one value.
Narayana • June 18th, 2013
Thanks Phil, looking forward to more posts like this :)
Torleif Berger • June 19th, 2013
I'd say the problem in your example here is returning null, not the skipping of the null check.
FennNaten • June 19th, 2013
Of course, returning null in GetRepositories is a problem, I don't deny that. My example is, on purpose, full of issues ;)
If GetRepositories is your own code, of course you should fix it.

But I was saying that the GetRepositories can be located in a vendor's dll (for which you don't have the source code or can't modify it).

If you're calling any vendor code to get an object reference, in my opinion null check an throwing of proper exception on null is mandatory before calling any property of returned object, otherwise you're making the assumption that "the dll code will always behave as expected". Which can be ok if code is from 100% trusted source, sure, but you can also (unfortunately) have people who write the kind of code I give as example ^^'
Torleif Berger • June 19th, 2013
my issue with that is that a mindset of always checking for null results in bloated and ugly code. so I'd rather skip the check (as long as the method doesn't say explicitly that null is a valid return value of course) and have it crash on me. And then, if it crashes, fix the method itself if it's in my control, or report it as a bug if that's possible. If it's not possible to get the library fix, *then* add the check and handle it appropriately.

This is based on my current work with Java in a project where null checks have been thrown all over the place, "just in case", which makes it annoying to read the code. Especially since even when the null check is there for a good reason (the method can return null), the method shouldn't really return null but instead throw an exception. I'm more for fixing the underlaying problem than just hiding it behind a check and ignoring it. If we all do this, then those who produce code that returns null in stupid cases will never have a reason to stop. If we annoy them with bug reports etc, maybe the world will be a little bit better in a while :p
Timothy Boyce • June 19th, 2013
That's what I'm thinking too. If there are duplicate keys and you think you need a Dictionary, then you probably don't really understand the data. If the data is correct, then your code is probably wrong. Just switching to using a Lookup probably isn't going to make it right.
FennNaten • June 19th, 2013
I completely understand your point, and I agree, null checks all over the place are a real pain.
An option my team took in a previous job where we were client of a lot of external libraries was to use unit-test like asserting.
For example it would go like that:

IEnumerable test = vendorCode.GetElements();
Require(test).NotNull("optional message"); //this throws if condition not met

/*do stuff with test*/

I find it to be the less painful option: code stays readable and your intend is clear. Plus it forces you to always weight your options while coding. "ok, what values can this method return and what are the requirement for me to be able to use a value? Should I do something if the value's not a usable one? Let's state it."
But that doesn't prevent from annoying responsibles with bug reports though ;)
haacked • June 19th, 2013
Who cares if the code is less bloated if it crashes on a user? My rule of thumb is document your assumptions. If you believe a method should not return null, then Debug.Assert that. Or, if you really don't care if the result is null, use the null coalescing operator. That is pretty clean and not bloated.
Mauricio Scheffer • June 20th, 2013
The problem with exceptions is they don't compose, and partial functions lead to these "code mines". See for example http://blogs.atlassian.com/... and https://blogs.atlassian.com... .

If you want a dictionary<key,value> as result, you could also make it total by returning something like Either<string, dictionary<key,="" value="">> where the string contains the error message. This forces you to deal with the case of failure, as opposed to the exception.
Mauricio Scheffer • June 20th, 2013
My point is about avoiding partial functions, not about understanding data or not. If the input type allows some "unwanted data", then the function must deal with this. Please see my reply to Phil.