In this post, we discuss how we have evolved our search technology to accommodate diverse document types, the surprising performance impact of these changes, and how we are using this improved technology to power Twitter’s latest product efforts.
Our search infrastructure team is building Omnisearch, a new information retrieval system to power the next generation of relevance-based, personalized products. We recently launched SuperRoot, the first major architectural component of Omnisearch. In this blog post, we detail the path to building and launching a high-SLA production system at Twitter.
Finagle is our fault tolerant, protocol-agnostic RPC framework built atop Netty. Twitter’s core services are built on Finagle, from backends serving user profile information, Tweets, and timelines to front end API endpoints handling HTTP requests.
Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercialanalyticaldatabases.
Today we’re open-sourcing the Hosebird Client (hbc) under the ALv2 license to provide a robust Java HTTP library for consuming Twitter’s Streaming API. The client is full featured: it offers support for GZip, OAuth and partitioning; automatic reconnections with appropriate backfill counts; access to raw bytes payload; proper retry schemes, and relevant statistics.
We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).
Over time Tweets have acquired a language all their own. Some of these have been around a long time (like @username at the beginning of a Tweet) and some of these are relatively recent (such as lists) but all of them make the language of Tweets unique. Extracting these Tweet-specific components from a Tweet is relatively simple for the majority of Tweets, but like most text parsing issues the devil is in the details.