TwitterCLDR: Improving Internationalization Support in Ruby

Wednesday, 1 August 2012

We recently open sourced TwitterCLDR under the Apache Public License 2.0. TwitterCLDR is an “ICU level” internationalization library for Ruby that supports dates, times, numbers, currencies, world languages, sorting, text normalization, time spans, plurals, and unicode code point data. By sharing our code with the community we hope to collaborate together and improve internationalization support for websites all over the world. If your company is considering supporting multiple languages, then you can try TwitterCLDR to help your internationalization efforts.

Motivation

Here’s a test. Say this date out loud: 2/1/2012

If you said, “February first, 2012”, you’re probably an American. If you said, “January second, 2012”, you’re probably of European or possibly Asian descent. If you said, “January 12, 1902”, you’re probably a computer. The point is that as humans, we almost never think about formatting dates, plurals, lists, and the like. If you’re creating a platform available around the world, however, these kinds of minutiae make a big difference to users.

The Unicode Consortium publishes and maintains a bunch of data regarding formatting dates, numbers, lists, and more, called the Common Locale Data Repository (CLDR). IBM maintains International Components for Unicode (ICU), a library that uses the Unicode Consortium’s data to make it easier for programmers to use. However, this library is targeted at Java and C/C++ developers and not Ruby programmers, which is one of the programming languages used at Twitter. For example, Ruby and TwitterCLDR helps power our Translation Center. TwitterCLDR provides a way to use the same CLDR data that Java uses, but in a Ruby environment. Hence, formatting dates, times, numbers, currencies and plurals should now be much easier for the typical Rubyist. Let’s go over some real world examples.

Example Code

Dates, Numbers, and Currencies

Let’s format a date in Spanish (es):

$> DateTime.now.localize(:es).to_full_s
$> "lunes, 12 de diciembre de 2011 21:44:57 UTC -0800"

Too long? Make it shorter:

$> DateTime.now.localize(:es).to_short_s
$> "12/12/11 21:44" 

Built in support for relative times lets you do this:

$> (DateTime.now - 1).localize(:en).ago.to_s
$> "1 day ago"
$> (DateTime.now + 1).localize(:en).until.to_s
$> "In 1 day"

Number formatting is easy:

$> 1337.localize(:en).to_s
$> "1,337"
$> 1337.localize(:fr).to_s
$> "1 337"

We’ve got you covered for currencies and decimals too:

$> 1337.localize(:es).to_currency.to_s(:currency => "EUR")
$> "1.337,00 €"
$> 1337.localize(:es).to_decimal.to_s(:precision => 3)
$> "1.337,000"

Currency data? Absolutely:

$> TwitterCldr::Shared::Currencies.for_country("Canada")
$> { :currency => "Dollar", :symbol => "$", :code => "CAD" }

Plurals

Get the plural rule for a number:

$> TwitterCldr::Formatters::Plurals::Rules.rule_for(1, :ru)
$> :one
$> TwitterCldr::Formatters::Plurals::Rules.rule_for(3, :ru)
$> :few
$> TwitterCldr::Formatters::Plurals::Rules.rule_for(10, :ru)
$> :many

Embed plurals right in your translatable phrases using JSON syntax:

$> str = 'there % in the barn'
$> str.localize % { :horse_count => 3 }
$> "there are 3 horses in the barn"

Unicode Data

Get attributes for any Unicode code point:

$> code_point = TwitterCldr::Shared::CodePoint.for_hex("1F3E9")
$> code_point.name
$> "LOVE HOTEL"
$> code_point.category
$> "So"

Normalize strings using Unicode’s standard algorithms (NFD, NFKD, NFC, or NFKC):

$> "español".localize.code_points
$> ["0065", "0073", "0070", "0061", "00F1", "006F", "006C"]
$> "español".localize.normalize(:using => :NFKD).code_points
$> ["0065", "0073", "0070", "0061", "006E", "0303", "006F", "006C"]

Sorting (Collation)

TwitterCLDR includes a pure Ruby, from-scratch implementation of the Unicode Collation Algorithm (with tailoring) that enables locale-aware sorting capabilities.

Alphabetize a list using regular Ruby sort:

$> ["Art", "Wasa", "Älg", "Ved"].sort
$> ["Art", "Ved", "Wasa", "Älg"]

Alphabetize a list using TwitterCLDR’s locale-aware sort:

$> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
$> ["Älg", "Art", "Ved", "Wasa"]

NOTE: Most of these methods can be customized to your liking.

JavaScript Support

What good is all this internationalization support in Ruby if I can’t expect the same output on the client side too? To bridge the gap between the client and server sides, TwitterCLDR also contains a JavaScript implementation (known as twitter-cldr-js) whose compiled files are maintained in a separate GitHub repo. At the moment, twitter-cldr-js supports dates, times, relative times, and plural rules. We’re working on expanding its capabilities, so stay tuned.

Future Work

In the future, we hope to add even more internationalization capabilities to TwitterCLDR, including Rails integration, phone number and postal code validation, support for Unicode characters in Ruby 1.8 strings and regular expressions, and the ability to translate timezone names via the TZInfo gem and ActiveSupport. We would love to have the community use TwitterCLDR and help us improve the code to reach everyone in the world.

Acknowledgements

Twitter CLDR was primarily authored by Cameron Dutro (@camertron). In addition, we’d like to acknowledge the following folks who contributed to the project either directly or indirectly: Kirill Lashuk (@kl_7), Nico Sallembien (@nsallembien), Sumit Shah (@omnidactyl), Katsuya Noguchi, Engineer (@kn), Timothy Andrew (@timothyandrew) and Kristian Freeman (@imkmf).


- Chris Aniszczyk, Manager of Open Source (@cra)