Introducing the Open Source Twitter Text libraries

Thursday, 4 February 2010

Over time Tweets have acquired a language all their own. Some of these have been around a long time (like @username at the beginning of a Tweet) and some of these are relatively recent (such as lists) but all of them make the language of Tweets unique. Extracting these Tweet-specific components from a Tweet is relatively simple for the majority of Tweets, but like most text parsing issues the devil is in the details.

We’ve extracted the code we use to handle Tweet-specific elements and released it as an open source library. This first version is available in Ruby and Java but in the Twitter spirit of openness we’ve also released a conformance test suite so any other implementations can verify they meet the same standards.

Tweet-specific Language

It all started with the @reply … and then it got complicated. Twitter users started the use of @username at the beginning of a Tweet to indicate a reply, but you’re not here to read about history. In order to talk about the new Twitter Text libraries one needs to understand the Tweet-specific elements we’re interested in. Much of this will be a review of what you already know but a shared vocabulary will help later on. While the Tweet-specific language is always expanding the current elements consist of:

  • @reply This is a Tweet which begins with @username. This is distinct from the presence of @username elsewhere in the Tweet (more on that in a moment). An @reply Tweet is considered directly addressed to the @username and only some of your followers will see the Tweets (notably, those who follow both you and the @username).
  • @mention This is a Tweet which contains one or more @usernames anywhere in the Tweet. Technically an @reply is a type of @mention, which is important from a parsing perspective. An @mention Tweets will be delivered to all of your followers regardless of is the follow the @mentioned user or not.
  • @username/list-name Twitter lists are referenced using the syntax @username/list-name where the list-name portion has to meet some specific rules.
  • #hashtag As long has there has been a way to search Tweets* people have been adding information to make the easy to find. The #hashtag syntax has become the standard for attaching a succinct tag to Tweets.
  • URLs While URLs are not Tweet-specific they are an important part of Tweets and require some special handling. There is a vast array of services based on the URLs in Tweets. In addition to services that extract the URLs most people expect URLs to be automatically converted to links when viewing a Tweet.

Twitter Text Libraries

For this first version of the Twitter Text libraries we’ve released both Ruby and Java versions. We certainly expect more languages in the future and we’re looking forward to the patches and feedback we’ll get on these first versions.

For each library we’ve provided functions for extracting the various Tweet-specific elements. Displaying Tweets in HTML is a very common use case so we’ve also included HTML auto-linking functions. The individual language interfaces differ so they can feel as natural as possible for each individual language.

Ruby Library

The Ruby library is available as a gem via gemcutter or the source code can be found on github. You can also peruse the rdoc hosted on github. The Ruby library is provided as a set of Ruby modules so they can be included in your own classes and modules. The rdoc is a more complete reference but for a quick taste check out this short example:

class MyClass
  include Twitter::Extractor
  usernames = extract_mentioned_screen_names("Mentioning @twitter and @jack")
  # usernames = ["twitter", "jack"]
end

The interface makes this all seems quite simple but there are some very complicated edge cases. I’ll talk more about that in the next section, Conformance Testing.

Java Library

The source code for the Java library can be found on github. The library provides an ant file for buildinf the twitter-text.jar file. You can also peruse the javadocs hosted on github. The Java library provides Extractor and Autolink classes that provide object-oriented methods for extraction and auto-linking. The javadoc is a more complete reference but for a quick taste check out this short example:

import java.util.List;
import com.twitter.Extractor;

public class Check {
  public static void main(String[] args) {
    List names;
    Extractor extractor = new Extractor();

    names = extractor.extractMentionedScreennames("Mentioning @twitter and @jack");
    for (String name : names) {
      System.out.println("Mentioned @" + name);
    }
  }
}

The library makes this all seems quite simple but there are some very complicated edge cases.

Conformance Testing

While working on the Ruby and Java version of the Twitter Text libraries it became pretty clear that porting tests to each language individually wasn’t going to be sustainable. To help keep things in sync we created that Twitter Text Conformance project. This project provides some simple yaml files that define the expected before and after states for testing. The per-language implementation of these tests can vary along with the per-language interface, making it intuitive for programmers in any language.

The basic extraction and auto-link test cases are easy to understand but the edge cases about. Many of the largest complications come from handling Tweets written in Japanese and other languages that don’t use spaces. We also try to be lenient with the allowed URL characters, which creates some more headaches.

@mzsanford