A while back we announced on our API developers list that we would change the way we generate unique ID numbers for tweets.
While we’re not quite ready to make this change, we’ve been hard at work on Snowflake which is the internal service to generate these ids. To give everyone a chance to familiarize themselves with the techniques we’re employing and how it’ll affect anyone building on top of the Twitter platform we are open sourcing the Snowflake code base today.
Before I go further, let me provide some context.
We currently use MySQL to store most of our online data. In the beginning, the data was in one small database instance which in turn became one large database instance and eventually many large database clusters. For various reasons, the details of which merit a whole blog post, we’re working to replace many of these systems with the Cassandra distributed database or horizontally sharded MySQL (using gizzard).
Unlike MySQL, Cassandra has no built-in way of generating unique ids – nor should it, since at the scale where Cassandra becomes interesting, it would be difficult to provide a one-size-fits-all solution for ids. Same goes for sharded MySQL.
Our requirements for this system were pretty simple, yet demanding:
We needed something that could generate tens of thousands of ids per second in a highly available manner. This naturally led us to choose an uncoordinated approach.
These ids need to be roughly sortable, meaning that if tweets A and B are posted around the same time, they should have ids in close proximity to one another since this is how we and most Twitter clients sort tweets.
Additionally, these numbers have to fit into 64 bits. We’ve been through the painful process of growing the number of bits used to store tweet ids before. It’s unsurprisingly hard to do when you have over 100,000 different codebases involved.
We considered a number of approaches: MySQL-based ticket servers (like flickr uses), but those didn’t give us the ordering guarantees we needed without building some sort of re-syncing routine. We also considered various UUIDs, but all the schemes we could find required 128 bits. After that we looked at Zookeeper sequential nodes, but were unable to get the performance characteristics we needed and we feared that the coordinated approach would lower our availability for no real payoff.
To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.
Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).
We encourage you to peruse and play with the code: you’ll find it on github. Please remember, however, that it is currently alpha-quality software that we aren’t yet running in production and is very likely to change.
If you find bugs, please report them on github. If you are having trouble understanding something, come ask in the #twinfra IRC channel on freenode. If you find anything that you think may be a security problem, please email email@example.com (and cc myself: firstname.lastname@example.org).
 In mathematical terms, although the tweets will no longer be sorted, they will be k-sorted. We’re aiming to keep our k below 1 second, meaning that tweets posted within a second of one another will be within a second of one another in the id space too.