On the maddening subtleties of localizing software

Sean M. Burke and Jordan Lachler's "A Localization Horror Story: It Could Happen To You" is an extremely entertaining and illuminating look at the subtleties inherent in localizing software for different languages; a program that only produces two kinds of output ("I scanned 12 directories" and "Your query matched 10 files in 4 directories") turns out to generate a maddening combinatorial explosion of cases for translation into a small number of languages.

So, you email your various translators (the boss decides that the languages du jour are Chinese, Arabic, Russian, and Italian, so you have one translator for each), asking for translations for "I scanned %g directory." and "I scanned %g directories.". When they reply, you'll put that in the lexicons for gettext to use when it localizes your software, so that when the user is running under the "zh" (Chinese) locale, gettext("I scanned %g directory.") will return the appropriate Chinese text, with a "%g" in there where printf can then interpolate $dir_scan.

Your Chinese translator emails right back -- he says both of these phrases translate to the same thing in Chinese, because, in linguistic jargon, Chinese "doesn't have number as a grammatical category" -- whereas English does. That is, English has grammatical rules that refer to "number", i.e., whether something is grammatically singular or plural; and one of these rules is the one that forces nouns to take a plural suffix (generally "s") when in a plural context, as they are when they follow a number other than "one" (including, oddly enough, "zero"). Chinese has no such rules, and so has just the one phrase where English has two. But, no problem, you can have this one Chinese phrase appear as the translation for the two English phrases in the "zh" gettext lexicon for your program.

Emboldened by this, you dive into the second phrase that your software needs to output: "Your query matched 10 files in 4 directories.". You notice that if you want to treat phrases as indivisible, as the gettext manual wisely advises, you need four cases now, instead of two, to cover the permutations of singular and plural on the two items, $dir_count and $file_count.

A Localization Horror Story: It Could Happen To You

(via O'Reilly Radar)

(Image: File:Brueghel-tower-of-babel.jpg, Wikimedia Commons)