Internationalized URLs

0 comments suggest edit

Despite an international team of committers to Subtext and the fact that MySpace China uses a customized version of Subtext for its blog, I am ashamed to say that Subtext’s support for internationalization has been quite weak.

world
map

True, I did once write that The Only Universal Language in Software is English, but I didn’t mean that English is the only language that matters, especially on the web.

One area that we need to improve is in dealing with international URLs. For example, if I’m a user in Korea, I should be able to write a post with a Korean domain and a Korean title and thus have a friendly URL like so:

http://하쿹.com/blog/안녕하십니까.aspx

(As an aside, roughly speaking, 하쿹 would be pronounced hah-kut. About as close as I can get to haacked which is pronounced like hackt.)

If you’re a kind soul, you will forgive us for punting on this issue for so long. After all, RFC 2396, which defines the syntax for Uniform Resource Identifiers (URI) only allows for a subset of ASCII (about 60 characters).

But then again, I’ve been hiding behind this RFC as an excuse for a while fully knowing there are workarounds. I have just been too busy to fix this.

There are two issues here actually, the hostname (aka domain name) which is quite restrictive and cannot be URL encoded, AFAIK, and the rest of the URL which can be encoded.

The domain name issue is resolved by the diminutively named Punycode (described in RFC 3492). Punycode is a protocol for converting Unicode strings into the more limited set of ASCII characters for network host names.

For example, http://你好.com/ translates to http://xn–6qq79v.com/in Punycode.

Fortunately, this issue is pretty easy to fix. Since the browser is responsible for converting the Unicode domain name in the URL to Punycode, all we need to do in Subtext is allow users to setup a hostname that contains Unicode and we can then convert that to Punycode using something like the Punycode / IDN library for .NET 2.0. For this blog post, I used the web based phlyLabs IDNA Converter for converting Unicode to Punycode.

The second issue is rest of the URL. When you enter a title of a URL in Subtext, we convert that to a human and URL friendly ASCII “slug”. For example, if you enter the title “I like lamp” for a blog post, Subtext creates the friendly URL ending with “i_like_lamp.aspx”.

We haven’t totally ignored international URLs. For international western languages, we have code that effectively replaces accented characters with a close ASCII equivalent. A couple of examples (there are more in our unit tests) are:

Åñçhòr çùè becomes Anchor_cue

Héllò wörld becomes Hello_world

Unfortunately for my Korean brethren, something like 안녕하십니까 becomes (empty string). Well that totally sucks!

The thing is, the simple solution in this case is to just allow the Unicode Korean word as the slug. Browsers will apply the correct URL encoding to the URL. Thus https://haacked.com/안녕하십니까/ would become a request for https://haacked.com/%EC%95%88%EB%85%95%ED%95%98%EC%8B%AD%EB%8B%88%EA%B9%8C/and everything works just fine as far as I can tell. Please note that Firefox 2.0 actually replaces the Unicode string in the address bar with the encoded string while IE7 displays the Unicode as-is, but makes the request using the encoded URL (as confirmed by Fiddler).

For western languages in which we can do a decent enough conversion to ASCII, the benefit there is the URL remains somewhat readable and “friendlier” than a long URL encoded string. But for non-western scripts, we have no choice but to deal with these ugly URL encoded strings (at least in Firefox).

The interesting thing is, when researching how sites in China handle internationalized URLs, I discovered that in the same way we did, they simply punt on the issue. For example, http://baidu.com/, the most popular search engine in China last I checked, has English URLs.

Tags: URL , Localization , Punycode

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.

Comments

avatar

12 responses

  1. Avatar for Barry Kelly
    Barry Kelly November 29th, 2007

    Don't forget the major spoofing problems with proper display of Unicode characters in URLs - character glyphs that look similar to normal Western ASCII characters, but have different code points. Not good.

  2. Avatar for Julian
    Julian November 29th, 2007

    You say that when a user types the URL in the address bar, Browsers automatically apply the correct URL encoding. But as far as I have observed, they dont apply Unicode encoding, rather latin-1, so it doesnt work if you type ü for example, which becomes %FC instead of %C3%BC. Any ideas?
    Wikepedia seems to convert it somehow and redirects you to the right page.

  3. Avatar for Nathan
    Nathan November 29th, 2007

    What about phoneticizing the word? You mention above that 하ㅋㄷ has a phonetic equivalent which can be expressed in ASCII-range characters. I believe that most (all?) east-Asian languages have a fairly standard glyph/ideogram-to-phonetic-spelling mapping that could be applied. This would probably be more useful than a big nasty urlencoded string.

  4. Avatar for Nicholas Paldino [.NET/C# MVP]
    Nicholas Paldino [.NET/C# MVP] November 29th, 2007

    If I remember my Korean correctly, Hangul requires consonant-vowel combinations, and what you wrote was the equivalent of writing some English consonant combinations which don't make sense on their own (like if you saw "tqk" on its own).
    Wikipedia seems to support this, if that matters at all:
    Syllabic blocks
    Except for a few grammatical morphemes in archaic texts, no letter may stand alone to represent elements of the Korean language. Instead, jamo are grouped into syllabic blocks of at least two and often three: (1) a consonant or consonant cluster called the initial (초성, 初聲 choseong syllable onset), (2) a vowel or diphthong called the medial (중성, 中聲 jungseong syllable nucleus), and, optionally, (3) a consonant or consonant cluster at the end of the syllable, called the final (종성, 終聲 jongseong syllable coda).
    A more correct phoneticization of "hacked" which makes more sense in Korean (IMO) is 하쿹

  5. Avatar for Haacked
    Haacked November 29th, 2007

    @Nicholas Thanks for the correction. You're absolutely right.

  6. Avatar for WaterBreath
    WaterBreath November 29th, 2007

    > Åñçhòr çùè becomes Anchor_cue
    As Barry Kelly pointed out, this seems very vulnerable to "homograph" URL spoofing attacks, as described here: http://www.icann.org/announ...
    Example: èbäŷ.com becomes ebay.com

  7. Avatar for Haacked
    Haacked November 30th, 2007

    Well this is only for the "slug" not for the domain name. I wouldn't do that for the domain name.

  8. Avatar for Lance Fisher
    Lance Fisher November 30th, 2007

    I don't see the "homograph" threat when the url encoding is done on the folder names and not the domain name. Subtext wouldn't (and couldn't) be responsible for the domain name. Right?

  9. Avatar for Haacked
    Haacked November 30th, 2007

    @Lance - exactly. After all, it's the hoster of the site who has to *own* the domain name and set it up in IIS. Subtext does require you configure that domain name (since we allow multiple blog hosting in a single blog). The only thing we can do is help you with converting a unicode host name to the Punycode.

  10. Avatar for wiennat
    wiennat December 2nd, 2007

    I've tried to patch subtext to support Internationalized URLs. I made it half way but ,unfortunately, I have a lot of work recently so I still don't have a chance to finish it yet.
    Feeling ashame, I am.

  11. Avatar for Kirit Sælensminde
    Kirit Sælensminde December 4th, 2007

    Phil, great that you're raising awareness about these URL formats. I've been using URLs like that on my site for a couple of years now with mixed results.
    Although the browsers and search engines can handle them properly, very little other software can. I've written in more detail about some of the problems: Internationalised URLs. I suspect this goes a long way towards explaining why the Chinese site uses ASCII URLs.

  12. Avatar for Wiennat's Blog
    Wiennat's Blog December 8th, 2007

    Internationalized URL for Subtext