The Turkish İ Problem and Why You Should Care

code 0 comments suggest edit

Take a look at the following code.

const string input = "interesting";
bool comparison = input.ToUpper() == "INTERESTING";
Console.WriteLine("These things are equal: " + comparison);
Console.ReadLine();

Let’s imagine that input is actually user input or some value we get from an API. That’s going to print out These things are equal: True right? Right?!

Well not if you live in Turkey. Or more accurately, not if the current culture of your operating system is tr-TR (which is likely if you live in Turkey).

To prove this to ourselves, let’s force this application to run using the Turkish locale. Here’s the full source code for a console application that does this.

using System;
using System.Globalization;
using System.Threading;
internal class Program
{
    private static void Main(string[] args)
    {      
        Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
        const string input = "interesting";
        
        bool comparison = input.ToUpper() == "INTERESTING";

        Console.WriteLine("These things are equal: " + comparison);
        Console.ReadLine();
    }
}

Now we’re seeing this print out These things are equal: False.

To understand why this is the case, I recommend reading much more detailed treatments of this topic:

The tl;dr summary summary is that the uppercase for i in English is I (note the lack of a dot) but in Turkish it’s dotted, İ. So while we have two i’s (upper and lower), they have four.

My app is English only. AMURRICA!

Even if you have no plans to translate your application into other languages, your application can be affected by this. After all, the sample I posted is English only.

Perhaps there aren’t going to be that many Turkish folks using your app, but why subject the ones that do to easily preventable bugs? If you don’t pay attention to this, it’s very easy to end up with a costly security bug as a result.

The solution is simple. In most cases, when you compare strings, you want to compare them using StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase. It just turns out there are so many ways to compare strings. It’s not just String.Equals.

Code Analysis to the rescue

I’ve always been a fan of FxCop. At times it can seem to be a nagging nanny constantly warning you about crap you don’t care about. But hidden among all those warnings are some important rules that can prevent some of these stupid bugs.

If you have the good fortune to start a project from scratch in Visual Studio 2010 or later, I highly recommend enabling Code Analysis (FxCop has been integrated into Visual Studio and is now called Code Analysis). My recommendation is to pick a set of rules you care about and make sure that the build breaks if any of the rules are broken. Don’t turn them on as warnings because warnings are pointless noise. If it’s not important enough to break the build, it’s not important enough to add it.

Of course, many of us are dealing with existing code bases that haven’t enforced these rules from the start. Adding in code analysis after the fact is a daunting task. Here’s an approach I took recently that helped me retain my sanity. At least what’s left of it.

First, I manually created a file with the following contents:

<?xml version="1.0" encoding="utf-8"?>
<RuleSet Name="PickAName" Description="Important Rules" ToolsVersion="10.0">
  <Rules AnalyzerId="Microsoft.Analyzers.ManagedCodeAnalysis"
      RuleNamespace="Microsoft.Rules.Managed">

    <Rule Id="CA1309" Action="Error" />    
  
  </Rules>
</RuleSet>

You could create one per project, but I decided to create one for my solution. It’s just a pain to maintain multiple rule sets. I named this file SolutionName.ruleset and put it in the root of my solution (the name doesn’t matter. Just make the extension .ruleset)

I then configured each project that I cared about in my solution (I ignored the unit test project) to enable code analysis using this ruleset file. Just go to the project properties and select the Code Analysis tab.

CodeAnalysisRuleSet

I changed the selected Configuration to “All Configurations”. I also checked the “Enable Code Analysis…” checkbox. I then clicked “Open” and selected my ruleset file.

At this point, every time I build, Code Analysis will only run the one rule, CA1309, when I build. This way, adding more rules becomes manageable. Every time I fixed a warning, I’d add that warning to this file one at a time. I went through the following lists looking for important rules.

I didn’t add every rule from each of these lists, only the ones I thought were important.

At some point, I reached the point where I was including a large number of rules and it made sense for me to invert the list so rather than listing all the rules I want to include, I only listed the ones I wanted to exclude.

<?xml version="1.0" encoding="utf-8"?>
<RuleSet Name="PickAName" Description="Important Rules" ToolsVersion="10.0">
  <IncludeAll Action="Error" />
  <Rules AnalyzerId="Microsoft.Analyzers.ManagedCodeAnalysis"
      RuleNamespace="Microsoft.Rules.Managed">

    <Rule Id="CA1704" Action="None" />    
  
  </Rules>
</RuleSet>

Notice the IncludeAll element now makes every code analysis warning into an error, but then I turn CA1704 off in the list.

Note that you don’t have to edit this file by hand. If you open the ruleset in Visual Studio it’ll provide a GUI editor. I prefer to simply edit the file.

RuleSetEditor

One other thing I did was for really important rules where there were too many issues to fix in a timely manner, I would simply use Visual Studio to suppress all of them and commit that. At least that ensured that no new violations of the rule would be committed. That allowed me to fix the existing ones at my leisure.

I’ve found this approach makes using code analysis way more useful and less painful than simply turning on every rule and hoping for the best. Hope you find this helpful as well. May you never ship a bug with the Turkish I problem again!

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.

Comments

avatar

23 responses

  1. Avatar for John Gietzen
    John Gietzen July 5th, 2012

    Is there any guidance on which rules are actually important?
    For example, is signing your application with a strong name key even good advice?

  2. Avatar for haacked
    haacked July 5th, 2012

    @John great question. That's a rule I generally ignore unless I have some specific reason that requires it. It really depends on the application. For example, I ignore all the rules about COM because I don't care about COM interoperability. When that becomes a requirement, then maybe I'll turn them on.

  3. Avatar for Rick
    Rick July 5th, 2012

    Looks like no love for those of us slumming it the $500 2010 pro version.
    Ever notice how it's kinda a bummer to discover you're missing something you hadn't used before?

  4. Avatar for Gareth
    Gareth July 5th, 2012

    I can't let you post this and not link to Jon Skeet's ranttalk on this and other problems.

  5. Avatar for Fırat Esmer
    Fırat Esmer July 6th, 2012

    Thanks man! Nice article, I enjoyed it.

  6. Avatar for Tim Murphy
    Tim Murphy July 6th, 2012

    +1 for force warnings as errors. I am amazed how often I download code that has compiler warnings. I can never understand why people at the very least don't get annoyied by the warnings.

  7. Avatar for Rikki Mongoose
    Rikki Mongoose July 6th, 2012

    There's a famous story about Ramazan Çalçoban, who wrote to his friend (and probablty a girlfriend) Emin an SMS, that ws received with "sikisince" ("they are fucking you") instead of "sıkısınca" ("run out of arguments"). The result was: one murder, one suicide, three people are put in jail.

  8. Avatar for Doeke
    Doeke July 6th, 2012

    I thing the correct code would be to use "interesting".ToUpper () instead of "INTERESTING". But then again: your production example was probably a little different.
    But it's all in the context. I once wrote an SMS-application for a Turkey's cell phone provider. Once we learned SMS messages with turkey's characters only can be like 40 characters (instead of 140), just used the ascii representation. To make sure we didn't run into ambiguities (like the one Rikki mentioned), we had the texts reviewed by some Turkey's speaking colleagues.

  9. Avatar for Dony van Vliet
    Dony van Vliet July 6th, 2012

    Too bad .NET does not have a standard method on strings for Unicode case folding. Adhering to the Unicode standard by using case FOLDING to compare strings in a case-insensetive way instead of introducing both an ordinal and invariant way of doing a case-insensitive will not wipe away all the differences between cultures, but it does implement a well-known and well-defined way of comparing strings without any regards to case. Notice that String.Normalize(NormalizationForm) IS part of the framework since .NET 2.0! Maybe it's time for a few extension methods on strings to fill in the gap ...

  10. Avatar for DaveShaw
    DaveShaw July 6th, 2012

    @Rick, VS 2012 has Code Analysis in the Pro Version.

  11. Avatar for Nando
    Nando July 6th, 2012

    Thank you Phil for this little gem :)
    @Phil: can you share your RuleSet?
    I'm trying to be a more diligent developer using rules, but don't want use all :)

  12. Avatar for haacked
    haacked July 6th, 2012

    @Doeke: I think the right approach is to use input.Equals("interesting", StringComparison.OrdinalIgnoreCase);
    @Nando: Well my ruleset turns off some rules I actually want turned on because we have too many warnings. I'll see if I can create one that's my ideal case. Also, it really depends on whether you're writing a library vs an app.

  13. Avatar for Nando
    Nando July 6th, 2012

    @Phil: yes, I feel the same: too many warnings over there... :) Thanks!

  14. Avatar for Alexander Nyquist
    Alexander Nyquist July 6th, 2012

    Well, it is an interesting problem with many side effects. Take a look at this "funny" bug thread on PHP: https://bugs.php.net/bug.php?id=18556.

  15. Avatar for Mattias Larsson
    Mattias Larsson July 6th, 2012

    We ran in to this problem some years ago when deploying our product in Turkey. The problem occurred in our database queries. All tables/fields started with a capital letter, and of course some of those happened to start with an "I". But then some lazy programmers knew that the query parser (SQL in this case) wasn't case-sensitive, so a lot of queries were typed in lower-case only. These queries failed miserably and it was quite hard to find them all...

  16. Avatar for Alexander Nyquist
    Alexander Nyquist July 6th, 2012

    Hmm, previous comment was marked as spam.
    This is actually an important problem which can cause many un-obvious side effects. For instance, see this entry on PHPs bug tracker: https://bugs.php.net/bug.php?id=18556. It will essentially break your whole codebase if you set the locale to tr_TR.
    It's quite hard to find problems like this so raising the awareness of it is great.

  17. Avatar for Yakup İpek
    Yakup İpek July 8th, 2012

    Most of serious applications have serious turkish localization bugs. I have been encounter it while using Telerik mvc extensions. Problem was html output differ when i change culture to turkish because of i-I conversation problems.
    Here is problem stackoverflow.com/.... And here is great article about turkish localization problem www.moserware.com/... .
    We have solution in .net but not in javascript. Any string manipulation framework like underscore.string.js, string.js or any other does not have localization support so we have just dirty solutions as it is in here stackoverflow.com/... .

  18. Avatar for Dennis Doomen
    Dennis Doomen July 10th, 2012

    @John As part of maintaining my own C# Coding Guidelines (www.csharpcodingguidelines.com), I've also defined a set of Visual Studio 2010 rule sets with a distinction between different types of systems. In fact, the document is just an extension to those rule sets with guidelines that cannot (yet) be automatically checked.

  19. Avatar for Alessandro Riolo
    Alessandro Riolo February 6th, 2013

    The first .Net SDK I had under my hands (probably a beta of 1.0, but I don't really remember, it was surely more than 10 years ago), was incorrectly capitalizing the Turkish i as I. I did raise the issue with Microsoft at the time, and while they never answered me, the next time I looked at it, they had corrected the issue.
    Funnily, I had looked because I had exactly the same issue with a JDK some time earlier, and I had managed to get Sun to fix that too :)

  20. Avatar for LordLiverpool
    LordLiverpool March 21st, 2014

    Does the file system treat the dotless small-case i and the upper-case dotted I the same way? In other words, in Windows, does a *lower-case* file path containing a dotless small i match the same *upper-case* file path with a dotless upper-case I on Turkish systems, but not on other locales? (Sorry, it's confusing to explain)

    I have a working locale-sensitive case-changing function based on LCMapString, but I'm wondering what to do with bits of the code where I tend to standardise file paths by making them lower case (and then later comparing them).

  21. Avatar for haacked
    haacked March 21st, 2014

    Great question! If you have access to a Windows machine you should try it out.

  22. Avatar for LordLiverpool
    LordLiverpool March 24th, 2014

    OK I tried on a Turkish Windows machine, and as I suspected, it treats upper-case dotless I and lower-case dotless i as separate characters. I guess this is for compatibility. Imagine that locale-specific rules were used for casing; you create a folder called "i" and another folder called "I" on Turkish Windows, and that's OK because they're two separate characters. Now a colleague with German Windows tries to connect to your machine to access those folders, but he can't, because the names are identical according to Western casing rules.

    Interestingly, if you create a folder named upper-case Greek sigma (Σ), even non-Greek Windows won't let you create a folder named with either of the two lower-case sigma characters (ς, σ), suggesting that it converts to upper case and compares.

    The conclusion: locale-specific case changing is needed when you're dealing with end-user text, but file-system operations should continue to use the ordinary case-change functions.

  23. Avatar for Kresten Birkegaard Gregersen
    Kresten Birkegaard Gregersen September 9th, 2014

    Hi and thanks for the article.
    I stumbled on it while investigating an issue with our asp.net web application, and thought I would share my findings.
    The issue only occurred when the language was set to tr-TR.
    The issue is in regards to accessing data in a DataRowView in a DataView. Here is a simple version of the code:
    DB:
    select 1 as index --note the lower case i

    c#
    foreach (DataRowView r in dv)
    {
    return r.Row["Index"]; //note the capital I
    }

    The above works for en-GB and others, but not for tr-TR. Changing the i/I in the DB or c# so the case matches resolves the issue.
    Unfortunately there does not seem to be a way to control the way the .Row["xxxx"] does its string compression, which I guess it must do, otherwise it should not work for en-GB either...