Overview of the Twitter Cloud Platform: Compute

Thursday, 28 April 2016

On March 31, we hosted the first ever #compute event at our headquarters in San Francisco. Over 150 people attended the event and heard our engineers talk about the present and future of how we build and operate the compute infrastructure as part of the Twitter Cloud Platform. We also gained insights from many in the community on the challenges they face and how they are addressing similar problems. Overall, the #compute event served as a great forum for all us to connect, learn, and share ideas.

With that, we are excited to announce the availability of all the technical talks from the event on the Twitter University YouTube channel.

About the Twitter Cloud Platform: Compute
Twitter Cloud Platform: Compute powers over 95% of all stateless services at Twitter. It is built atop of open source technologies including Apache Mesos, Apache Aurora, and a suite of internal services that address both operator and user needs. The platform has grown from managing a few hundred containers to over 100,000 containers across tens of thousands of hosts. It has not only become a critical piece of Twitter’s underlying infrastructure, but our experience in building and operating it are contributing to the future of cloud infrastructure management in the industry.

Our vision for the Twitter Cloud Platform: Compute is around the following tenets:

Reliability: provide an always-available system with strong resource and performance guarantees
Developer agility: make it simple for developers to build, deploy, manage, and scale their services
Efficiency: run a well-utilized and cost-effective platform
Scalability: accommodate growing business needs without sacrificing reliability, developer agility, and efficiency

Given Twitter’s growing and diverse computing needs, we are building the next generation compute infrastructure that leverages both private and public clouds in a way that is reliable, scalable, developer-friendly, and cost-effective.

The #compute event

https://twitter.com/TwitterEng/status/710609684890189825

We had six technical talks, each focussed on specific challenges from the bottom-up in building and operating the Twitter Cloud Platform: Compute. The first two talks by Eric Danielson and David Robinson are operator-focused and deep dive on how we provision and manage servers and how a small team of SREs perform operational procedures on a large scale compute cluster. These serve as the building blocks of the platform and has an impact on two of our core tenets, reliability and scalability.

Server provisioning and management at scale
Tech lead of provisioning engineering, Eric Danielson talks about two specific systems: Audubon - Machine database and Wilson API/Lifecycle Manager - Machine Lifecycle Manager that are used to provision, track, and manage a fleet of tens of thousands of hosts at Twitter.

Managing a large scale compute platform
David Robinson, site reliability engineer on Compute, talks about how they leverage Audubon, Wilson, and other tooling to manage (configuration, deployment, and other operational procedures) across tens of thousands of hosts in the Compute cluster. He also shares the challenges when operating at this scale and things that can potentially go wrong that impact the reliability of the cluster.

Chargeback for multi-tenant infrastructure systems
Software engineers on Cloud Infrastructure Management, Vinu Charanya and Jessica Yuen, talk about how they built a generic system that helps define chargeable resources, collects utilization metrics, resolves owners, and generates billing and utilization reports to improve resource utilization and cost effectiveness of every multi-tenant infrastructure service at Twitter.

Aurora Workflows
One of the biggest cognitive overheads for a developer is to shepherd a deployment from development to production. Software engineer David Mclaughlin from the Compute team talks about our internal project called “Workflows” which aims to reduce this overhead and become a key building block for continuous deployment automation in the future.

Twitter Heron on the Aurora
Heron is the next generation real-time streaming analytics platform used heavily in Twitter. Heron runs diverse real-time analytics applications ranging from counting to real-time machine learning. Heron tech lead Maosong Fu talks about how Heron leverages Aurora extensively to run its topologies.

AWS + Aurora/Mesos
In 2014, TellApart (now part of Twitter) adopted Mesos/Aurora as the solution to run their infrastructure in AWS. Engineering Manager David Hagar shares the team’s experiences, the problems encountered, and how they addressed them. He also gives a glimpse into the work to extend the capabilities of the Compute infrastructure across private and public clouds.

Acknowledgements

We’d like to thank vice president of platform engineering Chris Pinkham, all the speakers, attendees, and folks who worked behind the scenes to make the #compute event possible. A special note of acknowledgement to Megan Carlisle, Holly Dyche, J.J. Jeyappragash, Ian Brown, Ian Downes, Derek Lyon, and Karthik Ramasamy.

We look forward to hosting the next one!

All photos courtesy Twitter, Inc.