Lost in Translation: Locales, not Languages

December 2nd, 2010

What’s the difference between locales and languages? Wikipedia (as always) has a pretty good definition:

In computing, locale is a set of parameters that defines the user’s language, country and any special variant preferences that the user wants to see in their user interface.

So, language is just one aspect of internationalisation, localisation and globalisation.

For example, The Simpsons is dubbed into Brazilian AND Portuguese – even though the language is essentially the same. However, Castilian Spaniards complain that The Simpsons is dubbed into Mexican, rather than into their native tongue.

Another example: Catalan, rather than Castellano is spoken in some regions of Spain, and dialects of Catalan (Malloquin, Valenciano) are spoken in others. In these cases language is not specific to a country, but to a region.

In the US, people speak of vacations – but in the UK they’re called holidays.

These are all examples of locale versus language.

When making web applications accessible to new audiences, there’s a whole bunch of other stuff to consider – aside from language itself.

Locale is a huge subject – just browse this presentation from the Unicode Common Locale Data Repository.

Here are the main elements of locale we need to think about as software and application developers.

Number format

In Spanish, numbers are punctuated differently than in English, and this is true for almost every other language too.

For example, in English we’d write 10,569.53. In Spanish, that would be 10.569,53. The comma and full stop have changed meaning.

Also, pluralisation rules are relevant here. English has a very simple pluralisation structure – one or many.

For example, 1 thing, 2 things, 10 things. If there’s more than one ‘thing’ we simply tack an ‘s’ on the end of the word.

The rules are quite different in Russian, for example. There are three specific rules depending if the number ends in 1 or 2,3,4 or 5,6,7,8,9,0. There are also two general rules that depend on context within the sentence.

In software and web applications, we often use variables in translated sentences. For example, on Kyero.com we use a string like {number} beds so that we can display any number of beds and only get the actual word translated twice (bed and beds).

This doesn’t work for Russian translations at all. In english we only need to know if there’s either none or more than one bedroom (beds) or if there’s one (bed).

In Russian we’d actually need to architect the variables to create different translations for 1 or 2,3,4 or 5,6,7,8,9,0 – and it gets even more complicated.

In Russian, pluralisation depends not on the actual quantity, but the digit the number ends with.

  • Numbers ending in 1: If the number is 1, or the number ends in the word ‘один’ (example: 1, 21, 61) (but not 11)
  • Numbers ending in 2,3,4: If the number, or the last digit of the number is 2, 3 or 4, (example: 22, 42, 103, 4) (but not 12, 13 & 14)
  • Numbers ending in 5,6,7,8,9,0: All the ‘teens’ fit in to this category (11, 12, 13, 14)
  • General Quantity: Singular or Plural depending on context
  • Quantity not specified: Case appropriate to sentence position

Finally, currency settings fall into this category of number format too.

In England we’d write €50, but in Spain it would be 50€.

Aside from the position of the currency symbol, the symbol itself also changes depending on the locale.

In the US, the currency symbol is generally restricted to $. However, to an Australian, they’d need to differentiate between Australian dollars or US dollars.

Clearly, these considerations need to be borne in mind when creating the text strings to be translated. It’s difficult to retro-fit them to an application not originally intended for internationalisation.

Case conversion

This covers a multitude of rules including capitalisation, gender and tense.

The German language, for example, has different capitalisation rules to English.

Spanish houses would translate to Spanisch Häuser.

However, depending on the position of the word Spanish within the sentence, it may or may not be capitalised in German. Context is very important here.

In English we often use a noun or a proper noun as an adjective: Barcelona apartments.

In this case, the proper noun Barcelona is used as an adjective to modify another noun – apartments.

In French, nouns aren’t used as adjectives – so the actual structure of that string needs to be completely changed in French to appartements à Barcelona.

In our application, that’s likely to be stored as variables such as:
{property type} in {location}

The variables would be dynamically replaced by actual property types and locations depending on the web page being served.

In French, the translation changes depending on the property type AND the location type:

For example, appartements à Barcelone, but appartements en Espagne.

We can further complicate matters by adding another adjective:
New apartments in Barcelona

This would translate to nouveaux appartements à Barcelone and nouvelles villas à Barcelone – because of a gender difference in the two types of property in French.

Again, providing the translator with a single string of variables to translate such as:
{new} {property type} in {location} just won’t work in many languages.

Every string and variable needs to be originally constructed with a knowledge of the intended target locales.

Date and time formatting

You might be surprised how many ways date and time can be formatted – even for one locale. Of course, the number of options multiply greatly when each original date/time format needs to be translated into several other locales.

The calendar in use might be Buddhist, Gregorian, Islamic, Japanese or Traditional Chinese.

The actual date format might be virtually any combination of day, month, year, era, hours, minutes and seconds. Each of those could be represented by digits, letters, names or abbreviated names.

That’s well over 100 possible combinations of date and time formatting to consider – without even taking the different positioning options into consideration.

Solving this kind of complex, but standard translation puzzle for software developers is one of the aims of the CLDR project. Their library is constantly being contributed to and updated – and is available to download for free.

String Collation

This is about how text is sorted. For example, a list of Spanish locations in English would sort The Balearic Islands under B for Balearic, rather than T for The.

In Castilian Spanish, this location would translate to Las Islas Baleares, and would be sorted under either I or B, but definitely not, L.

What sort order would be used for the following words?
cina – Cina – çina – Çina

What if there’s a mixture of numbers and letters in a list? What if the list also contains roman numerals? How to deal with the Slovak ch which naturally sorts after H?

Again, the CLDR project aims to help software developers by providing a standard library of each of these cases. However, developers must design software applications with these requirements in mind.

Scripts

English is written in Latin script – which defines the alphabet and that it should be written from left-to-right.

Modern standard Arabic uses a right-to-left Arabic script, and standard Hindi uses a left-to-right Devanagari script.

Other non-Latin cases are: Simplified Chinese, Japanese, Hebrew, Greek and Russian.

Unfortunately, not every locale has simple script and direction rules. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts.

One further complication – content can also be predominantly written in one language – but quote another.

In Arabic and Hebrew text the content flows predominantly from right to left, but embedded numbers or text in other scripts (such as Latin) would still run left to right.

Text in other languages, such as English, can also be bidirectional if it includes excerpts from languages such as Arabic and Hebrew.

So, as well as choosing the correct script for the locale, developers also need to select the correct direction of writing and take care when mixing content in different languages on the same page.

What’s more, the fundamental design of an application needs to change for certain locales and scripts – more on this in another post.

Other stuff

  • Units of measurement (speed, distance, weight etc.) all change depending on locale
  • Paper size and printing defaults also change
  • Terminology and abbreviations referring to specific world time zones change
  • Numeric abbreviations such as ‘M for million’ or ‘K for thousand’ change

Summary

Locale encompasses so much more than language.

When trying to make a web application reach a new audience, we need to know what language they speak – and what variant of that language too.

We also need to know where they live and how they expect to see numbers, dates, currency symbols – and all of the other differences covered by locale.

Next, I’ll be looking at the issues of how best to handle the hidden text in an web application.

Martin Dell, Kyero.com

Taggged as:



Leave your comments about this article

Name:
E-Mail:
Website: