A Subtle Case Sensitivity Gotcha with Regular Expressions

Feb 29, 2016 regex suggest edit

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski

For other people, when confronted with writing a blog post about regular expressions, think “I know, I’ll quote that Jamie Zawinski quote!”

It’s the go to quote about regular expressions, but it’s probably no surprise that it’s often taken out of context. Back in 2006, Jeffrey Friedl tracked down the original context of this statement in a fine piece of “pointless” detective work. The original point, as you might guess, is a warning against trying to shoehorn Regular Expressions to solve problems they’re not appropriate for.

As XKCD noted, regular expressions used in the right context can save the day!

XKCD - CC BY-NC 2.5 by Randall Munroe

If Jeffrey Friedl’s name sounds familiar to you, it’s probably because he’s the author of the definitive book on regular expressions, Mastering Regular Expressions. After reading this book, I felt like the hero in the XKCD comic, ready to save the day with regular expressions.

The Setup

This particular post is about a situation where Jamie’s regular expressions prophecy came true. In using regular expressions, I discovered a subtle unexpected behavior that could have lead to a security vulnerability.

To set the stage, I was working on a regular expression to test to see if potential GitHub usernames are valid. A GitHub username may only consist of alphanumeric characters. (The actual task I was doing was a bit more complicated than what I’m presenting here, but for the purposes of the point I’m making here, this simplification will do.)

For example, here’s my first take at it ^[a-z0-9]+$. Let’s test this expression against the username shiftkey (a fine co-worker of mine). Note, these examples assume you import the System.Text.RegularExpressions namespace like so: using System.Text.RegularExpressions; in C#. You can run these examples online using CSharpPad, just be sure to output the statement to the console. Or you can use RegexStorm.net to test out the .NET regular expression engine.

Regex.IsMatch("shiftkey", "^[a-z0-9]+$"); // true

Great! As expected, shiftkey is a valid username.

You might be wondering why GitHub restricts usernames to the latin alphabet a-z. I wasn’t around for the initial decision, but my guess is to protect against confusing lookalikes. For example, someone could use a character that looks like an i and make me think they are shiftkey when in fact they are shıftkey. Depending on the font or whether someone is in a hurry, the two could be easily confused.

So let’s test this out.

Regex.IsMatch("shıftkey", "^[a-z0-9]+$"); // false

Ah good! Our regular expression correctly identifies that as an invalid username. We’re golden.

But no, we have another problem! Usernames on GitHub are case insensitive!

Regex.IsMatch("ShiftKey", "^[a-z0-9]+$"); // false, but this should be valid

Ok, that’s easy enough to fix. We can simply supply an option to make the regular expression case insensitive.

Regex.IsMatch("ShiftKey", "^[a-z0-9]+$", RegexOptions.IgnoreCase); // true

Ahhh, now harmony is restored and everything is back in order. Or is it?

The Subtle Unexpected Behavior Strikes

Suppose our resident shiftkey imposter returns again.

Regex.IsMatch("ShİftKey", "^[a-z0-9]+$", RegexOptions.IgnoreCase); // true, DOH!

Foiled! Well that was entirely unexpected! What is going on here? It’s the Turkish İ problem all over again, but in a unique form. I wrote about this problem in 2012 in the post The Turkish İ Problem and Why You Should Care. That post focused on issues with Turkish İ and string comparisons.

The tl;dr summary is that the uppercase for i in English is I (note the lack of a dot) but in Turkish it’s dotted, İ. So while we have two i’s (upper and lower), they have four.

This feels like a bug to me, but I’m not entirely sure. It’s definitely a surprising and unexpected behavior that could lead to subtle security vulnerabilities. I tried this with a few other languages to see what would happen. Maybe this is totally normal behavior.

Here’s the regular expression literal I’m using for each of these test cases: /^[a-z0-9]+$/i The key thing to note is that the /i at the end is a regular expression option that specifies a case insensitive match.

/^[a-z0-9]+$/i.test('ShİftKey'); // false

The same with Ruby. Note that the double negation is to force this method to return true or false rather than nil or a MatchData instance.

!!/^[a-z0-9]+$/i.match("ShİftKey")  # false

And just for kicks, let’s try Zawinski’s favorite language, Perl.

if ("ShİftKey" =~ /^[a-z0-9]+$/i) {
  print "true";    
}
else {
  print "false"; # <--- Ends up here
}

As I expected, these did not match ShİftKey but did match ShIftKey, contrary to the C# behavior. I also tried these tests with my machine set to the Turkish culture just in case something else weird is going on.

It seems like .NET is the only one that behaves in this unexpected manner. Though to be fair, I didn’t conduct an exhaustive experiment of popular languages.

The Fix

Fortunately, in the .NET case, there’s two simple ways to fix this.

Regex.IsMatch("ShİftKey", "^[a-zA-Z0-9]+$"); // false
Regex.IsMatch("ShİftKey", "^[a-z0-9]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant); // false

In the first case, we just explicitly specify capital A through Z and remove the IgnoreCase option. In the second case, we use the CultureInvariant regular expression option.

Per the documentation,

By default, when the regular expression engine performs case-insensitive comparisons, it uses the casing conventions of the current culture to determine equivalent uppercase and lowercase characters.

The documentation even notes the Turkish I problem.

However, this behavior is undesirable for some types of comparisons, particularly when comparing user input to the names of system resources, such as passwords, files, or URLs. The following example illustrates such as scenario. The code is intended to block access to any resource whose URL is prefaced with FILE://. The regular expression attempts a case-insensitive match with the string by using the regular expression $FILE://. However, when the current system culture is tr-TR (Turkish-Turkey), “I” is not the uppercase equivalent of “i”. As a result, the call to the Regex.IsMatch method returns false, and access to the file is allowed.

It may be that the other regular expression engines are culturally invariant by default when ignoring case. That seems like the correct default to me.

While writing this post, I used several helpful online utilities to help me test the regular expressions in multiple languages.

Useful online tools

https://repl.it/languages provides a REPL for multiple languages such as Ruby, JavaScript, C#, Python, Go, and LOLCODE among many others.
http://www.tutorialspoint.com/execute_perl_online.php is a Perl REPL since that last site did not include Perl.
http://regexstorm.net/tester is a regular expression tester that uses the .NET regex engine.
https://regex101.com/#javascript allows testing regular expressions using PHP, JavaScript, and Python engines.
http://rubular.com/ allows testing using the Ruby regular expression engine.

Found a typo or mistake in the post? suggest edit

Comments

18 responses

D.R. • February 29th, 2016
> It may be that the other regular expression engines are culturally invariant by default when ignoring case. That seems like the correct default to me.

Why so? Unicode defines sets for uppercase/lowercase letters, and I'm glad that .NET has a regular expression engine which respects Unicode wherever it can.

It is questionable, however, that a-z includes a turkish i as well. I guess, a good way is to not use A-Z, a-z or 0-9 altogether, and instead use \p{Lu}, \p{Lt}, \d. Nobody would expect that a lowercase turkish i would not match here.

Loved Friedl's book! Recommended for every programmer!
Rick Dailey • February 29th, 2016
To me, the very notion of case-sensitivity means you should specify a set of culture rules to evaluate that insensitivity. In my opinion, the real WTF is that any of these languages allow you to do this without an exception if you specify case-insensitive without a culture. Remember, even the "Invariant" culture is just "en" without "-us" or "-uk" at the end.

So by not specifying any culture, you're basically saying "anything goes" is my guess and since Turkish İ in your example does not fit into your local rules, it's an unbounded search for conversion of that character in any culture. I suspect a Turkish developer might think this is perfectly sane default behavior.
haacked • February 29th, 2016
> So by not specifying any culture, you're basically saying "anything goes" is my guess

Well, by not specifying a culture, you're saying the current culture with .NET, and invariant culture for other platforms.

Both of these are well specified behaviors. I'm taking issue with the default. I think it should be invariant.

> I suspect a Turkish developer might think this is perfectly sane default behavior.

I'm not so sure. I would imagine that they'd expect this to return true, but it returns false, no?

Regex.IsMatch("shıftkey", "^[a-z0-9]+$"); // false

Also, consider that my current culture is en-US in which case `İ` is not the capital of i so why is it matching?
Eric Falsken • February 29th, 2016
As mentioned below, the real solution is to not rely on `[a-zA-Z]` but to use character classes that encompass entire ranges of unicode character sets. `\w` is a much more reliable test since it matches any valid word character in any language. Even in Perl, you should turn on Unicode rules, but even without it, you can use `\p{Word}`.
Rick Dailey • February 29th, 2016
Ah, I understand. You're saying it should be Invariant because people who have been slinging Regex for a long time would expect it to do so, rather than sticking to the global .NET convention everything else uses CurrentCulture (such as ToLower() and StartsWith()). Seems like Regex should take in a CultureInfo as an overload, just like those methods do instead of that silly enum flag.

> Also, consider that my current culture is en-US in which case `İ` is not the capital of i so why is it matching?

("İ".ToLower(new CultureInfo("en-US")) == "i") // true

Unfortunately, I could not find a reference source, but perhaps because in US English a dotted İ is meaningless so the most useful behavior for English speakers would be to just treat it as a regular capital-I? I got nothing on that one.
haacked • February 29th, 2016
Except my task here is to explicitly only match a-z. \w would match too much.
Rick Dailey • February 29th, 2016
And to further clarify what I'm driving at, even in the default culture this works:

Regex.IsMatch("İ", "^[A-Za-z0-9]+$") // false

It's just that pesky IgnoreCase that screws stuff up, not [a-z] character class in and of itself.
Rob Head • March 1st, 2016
What do you expect this to return?

char.ToLower('İ', new CultureInfo("en-us"))

As 'İ' isn't part of the en-us alphabet I was assuming that it would return 'İ' (as the documentation implies) but it actually returns 'i'. That really doesn't seem expected or correct to me but cultures and time zones are basically the most confusing things in programming!

Edit: I should have mentioned that looking at Regex source the next char is returned using Char.ToLower when using the case insensitive option.
Rob Head • March 1st, 2016
Although this isn't about regular expressions, this does imply some extra rules: https://msdn.microsoft.com/...

If a best fit fallback is being used that would explain ToLower("İ") being mapped to "i".
Neil MacMullen • March 1st, 2016
Hi Phil, you might be interested in my regex tool www.textdistil.com. I often use it for same kind of scenario you're describing. You can enter a whole document of test input and then get a real-time set of matches as you tweak the regex. Obviously it's not a substitute for proper unit tests but is very helpful when trying to explore the problem space.
haacked • March 1st, 2016
> Ah, I understand. You're saying it should be Invariant because people who have been slinging Regex for a long time would expect it to do so, rather than sticking to the global .NET convention everything else uses CurrentCulture (such as ToLower() and StartsWith()). Seems like Regex should take in a CultureInfo as an overload, just like those methods do instead of that silly enum flag.

Yes! Exactly. It "sort of" follows the convention of strings, but regex is a different beast with a long history and idioms of its own.

I mean, what I'd really love is a regex literal in C#, but not sure if that will ever happen. :)
haacked • March 1st, 2016
> It is questionable, however, that a-z includes a turkish i as well.

But it doesn't include it. Check this out:

Console.WriteLine(Regex.IsMatch("shıftkey", "^[a-z]+$"));
Console.WriteLine(Regex.IsMatch("shıftkey", "^[a-z]+$", RegexOptions.CultureInvariant));
Console.WriteLine(Regex.IsMatch("shıftkey", "^[a-z]+$", RegexOptions.IgnoreCase));
Console.WriteLine(Regex.IsMatch("shıftkey", "^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant));

All of those return false!

> Loved Friedl's book! Recommended for every programmer!

:thumbsup:
Wayne • March 1st, 2016
a-z doesn't include turkish İ but it includes regular i which is also the lower-case form of the turkish capital İ. So when you ask for case-insensitive match of i you will successfully match İ.
@KernelCoreDump • March 2nd, 2016
.NET's RegEx case-insensitive match seems to make use of locale-dependent caseless matching because of the internal use of ToLower(), unless the CultureInvariant option is used.

I'd agree that the CultureInvariant option should be the default. Looks like the other languages do that. Wasn't it part of the TextInfo https://msdn.microsoft.com/... guidelines to use Invariant matching unless explicitly overridden to make use of the current locale?

The Unicode standard itself provides a locale-independent (Invariant) table for case folding, but leaves off at describing any locale-dependent case folding.
@KernelCoreDump • March 2nd, 2016
The documentation itself is incomplete. It states "The lowercase equivalent of c, modified according to culture, or the unchanged value of c, if c is already lowercase or not alphabetic."

That last phrase "if c is already lowercase or not alphabetic" should be qualified, with either one of the following:
- "... lowercase or not alphabetic in the target culture"
- "... lowercase or not alphabetic according to Unicode standards"

I also don't fully get what "modified according to culture" means. Would the other direction char.ToUpper('i', new CultureInfo("tr-TR")) be 'I' or 'İ"?
George Pollard • March 23rd, 2016
There's another gotcha in your regex that you missed: '$' allows newlines before the end of the string. You can use '\z' instead to disallow this (note that '\Z' is equivalent to '$'):

Regex.IsMatch("shiftkey\n", "^[a-zA-Z0-9]+$") // true
Regex.IsMatch("shiftkey\n", @"^[a-zA-Z0-9]+\z") // false
haacked • March 23rd, 2016
Really nice catch! The example I provided was simplified for the sake of demonstration and the actual regex for usernames is more complicated than this and doesn't have the same flaw.

Having said that, I will fix up this example and you just gave me an idea for a follow-up blog post! Thanks! :)
Cyril • January 18th, 2020
Hi, another online regex tester with a regex visualizer: https://extendsclass.com/regex-tester.html