What is the Common Locale Data Repository?
November 14th, 2011
If you’ve ever been involved in software or website localisation, most likely you will have questions that you just can’t answer like:
- What do Finnish people call their own language? or
- What’s the correct country-code to use for a website translated into Simplified Chinese?
The CLDR has these answers and more.
CLDR is one of the Unicode Consortium’s projects. In a nutshell:
The Unicode Consortium is a non-profit organization devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard, which specifies the representation of text in all modern software products and standards.
First off, if you’ve ever tried to represent accented or scripted characters on a website using anything other than a character set called UTF-8, you’ll know that it’s almost impossible to get right.
Unicode makes simple work of making sure that Chinese, Japanese, Hebrew, Arabic, Hindi & Urdu characters get faithfully represented on the web page or in the software application.
It’s not just script-based languages either, many Latin-based languages also have their own unique accents and special characters – and Unicode solves pretty much every one of the problems associated with them getting mangled.
An extension of this basic character set from Unicode is CLDR, a database of EVERYTHING software developers need to know about a locale (a language spoken in a specific location, such as Swiss-German, or Flemish-Belgian). This data includes:
- Dates/time formats
- Number/currency formats
- Measurement Units
- Sorting, Searching, Matching
- Names for Languages, Territories, Scripts, Timezones, Currencies
- Characters used by a language
Currently, CLDR contains information for over 500 locales – and it’s growing. They freely publish their data in a standardised XML format so that software applications and websites can programmatically remain in sync with the latest consensus on correct locale usage.
Importantly, CLDR data is arrived at by consensus and is ‘owned’ by the community. If you think that the standard translation for the Greek language of Ελληνικά is inaccurate, you can propose an alternative. If enough people agree with you, your change will become the new standard.
In addition to getting stuff like this correct first time, accurately localising your application or website is an essential part of reaching a broader audience and reducing the friction between the original language version and the translated version.
CLDR saves software developers and business owners from having to reinvent the wheel, and from translating the same standard content over and over again.
Oh, and the Finnish language is called Suomi and the correct country code for China is zh.
Thanks to CLDR, getting our country codes and language links correct for the 20+ languages we’re adding to Kyero.com has been greatly simplified.
Current & Future Locales for Kyero.com
(Locales sorted according to the Unicode Collation Algorithm – Latin scripts first A-Z, followed by Greek, Cyrillic, Hebrew, Arabic, Devanagari, Japanese and Chinese.) Script direction is left-to-right (LTR) or right-to-left (RTL).
- Lost in Translation: Locales, not Languages
- An great example of Kyero using Locale for translations
- Lost in Translation