Today, we’re excited to open source Clockwork Raven, a web application that allows users to easily submit data to Mechanical Turk for manual review and then analyze that data. Clockwork Raven steps in to do what algorithms cannot: it sends your data analysis tasks to real people and gets fast, cheap and accurate results. We use Clockwork Raven to gather tens of thousands of judgments from Mechanical Turk users every week.
We recently open sourced TwitterCLDR under the Apache Public License 2.0. TwitterCLDR is an “ICU level” internationalization library for Ruby that supports dates, times, numbers, currencies, world languages, sorting, text normalization, time spans, plurals, and unicode code point data. By sharing our code with the community we hope to collaborate together and improve internationalization support for websites all over the world.
Over time Tweets have acquired a language all their own. Some of these have been around a long time (like @username at the beginning of a Tweet) and some of these are relatively recent (such as lists) but all of them make the language of Tweets unique. Extracting these Tweet-specific components from a Tweet is relatively simple for the majority of Tweets, but like most text parsing issues the devil is in the details.