Mystery of The French Thousands Separator

I enjoy writing silly chat bots. To indulge my silliness, I’ve been exploring the Microsoft Bot Framework. Overall, it’s a pretty good framework, but I’ve had some weird bugs here and there. It’s unclear to me if they’re my fault or not. So to dig into them, I cloned the microsoft/botbuilder-dotnet to my machine and ran all the unit tests. It’s what I do.

One of the tests failed with the message:

Assert.AreEqual failed. Expected:<12 000,3000>. Actual:<12 000,3000>.

Can you spot the difference?

It’s a little hard to see, let’s write some code to take a closer look. I’ve posted the code on dotnetfiddle if you want to play with it.

using System;
using System.Linq;
using System.Globalization;
					
public class Program
{
	public static void Main()
	{
        char nbsp = (Char)160;
		var expected = string.Join(",",
			$"12 000,3000"
			.Select(c => (int)c));
		var actual = string.Join(",",
			"12 000,3000"
			.Select(c => (int)c));
Console.WriteLine($"12{nbsp}000,3000");
		Console.WriteLine(expected);
		Console.WriteLine(actual);
	}
}

This results in:

49,50,160,48,48,48,44,51,48,48,48
49,50,8239,48,48,48,44,51,48,48,48

Would you look at that?!

The third character is different! In the expected string it’s value is 160 which translates to U+00A0 in unicode, or what we would know as the nbsp (aka the NO-BREAK SPACE).

But on my machine, I get 8239 there, which is U+202F which is the lesser known cousin of nbsp, the nnbsp (aka NARROW NO-BREAK SPACE).

I dug into it a little more and the code that’s being tested is formatting a number for the fr-FR locale. The space character there is determined by the NumberFormatInfo.CurrencyGroupSeparator property for the locale.

A French beret

So I wrote a little code to test this out. Again, it’s on dotnetfiddle.

using System;
using System.Globalization;
					
public class Program
{
	public static void Main()
	{
		var separator = new CultureInfo("fr-FR", false)
			.NumberFormat
			.CurrencyGroupSeparator;
		
		Console.WriteLine(
			$"Currency Group Separator for fr-FR is {(int)separator[0]}");
		Console.WriteLine(
			$"As HEX {((int)separator[0]).ToString("X")}");
	}
}

Important note, I made sure to call the CultureInfo constructor that lets us ignore user-selected culture settings from the system. Otherwise this test might be flaky for those who have customized settings on their machine.

The result on my machine and in dotnetfiddle is:

Currency Group Separator for fr-FR is 8239
As HEX 202F

To be nice, I thought I’d fix the test and submit a PR. It’s the scouting rule. However, my PR failed the build. On the projects machines, that group separator is 160. What gives? So I dug into it more and discovered the Unicode CLDR project.

CLDR does not stand for “Certainly Long, Didn’t Read.” Rather, it’s the Unicode Common Locale Data Repository. It’s a project by the Unicode Consortium to provide locale data in an XML format.

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available.

It turns out that sometime in October 2018, the Unicode Consortium changed the thousands separator character for the French locale.

You can read it in their CLDR 34 release notes. And yes, it’s a bit TL;DR so I’ll quote the relevant section.

The French locale now uses narrow no-break space U+202F is [sic] several places: as the numeric grouping separator, in many short unit patterns, and in the locale display name patterns. It also changed normal space to no-break space U+00A0 in the wide unit patterns.

According to this JavaMoney issue…

And even more: they have some intentions to change this for other locales:

Unfortunately, all the TRAC links are broken so I couldn’t follow up to verify, but it seems reasonable.

So what does this all mean? Programming is hard. And programming for multiple locales is even harder. Be safe out there.

Perhaps the biggest lesson is any time you tell yourself “Oh, this’ll be a simple fix!” You’re probably wrong.

Comments

5 responses

Bertrand • May 18th, 2020
Duh. 😁
Iluvatar • May 20th, 2020
Good to know, I recently ran into a similar issue parsing numbers with locale specific seperators, my test data used a normal space as a seperator instead of the special one and I had no idea why it was failing. I feel like a test framework that showed the code difference automatically when a single character is different would be amazing.
Kristof • May 21st, 2020
This reminds of when I was sorting pictures…

I get it with GetDetailsOf(file, 12) and then try to parse it with [System.DateTime]::Parse(date), just letting it take my current culture, which matched the way GetDetailsOf() returned things.

FormatException…

I looked at the data, and the Date format matched. What’s going on?

Tried a couple of things, like putting in my own format and such…

Didn’t work.

Eventually started to look the data that got returned vs a string that I generated myself… They didn’t match. The one returned by the system contained some 8206 and 8207.

Odd.

Putting that in in Google showed me that I wasn’t the first person experiencing this: https://stackoverflow.com/a/18298371/162694

Filtering out those values made it work even without passing in a culture. Great!

Very weird that even the built-in new CurrentCulture("en-US") is incompatible with the way the shell returns.

Ah-well, lessons learned.
Chris • October 26th, 2021
Thanks for the write up! In Java/Scala/Kotlin this will hit you when upgrading from JDK8 to JDK11.
Chris • October 26th, 2021
UPDATE: for Java/Scala/Kotlin it was changed wrongly in JDK11-14.

And even worse: For de_AT the monetary grouping separator is a dot ‘.’, the number grouping separator is a blank ‘ ‘.

See https://bugs.openjdk.java.net/browse/JDK-8227313