Visualize Data Workflows with Ambrose


Last Friday at our Apache Pig Hackathon, we open-sourced Twitter Ambrose, a tool which helps authors of large-scale data workflows keep track of the overall status of a workflow and visualize its progress.

Ambrose was hatched at our last Hack Week by Bill Graham (@billonahill) and Andy Schlaikjer (@sagemintblue), which focused on internal tools and developer efficiency. At Twitter, we develop complex workflows to analyze massive data sets generated by our platform. Our engineers create these workflows using a variety of tools and languages, including Pig and Scalding. One difficulty many of us face when using these tools is observability: when a Pig script is executed, multiple MapReduce jobs might be launched, either in parallel or in a serial fashion if one job depends on the output of another. As these jobs run, the status of individual jobs can be monitored with the Hadoop Job Tracker UI, but overall progress of the script can be difficult to keep track of. With Ambrose, the real-time status of a complex series of MapReduce jobs can be visualized succinctly, so that we can quickly understand how far computation has progressed and diagnose failures in context.

In this screenshot, we see the Ambrose UI for a workflow compiled from a single Pig script. The circular chord diagram in the upper left highlights dependencies between jobs. As a job’s status changes, the color of its arc in the diagram changes. Statistics for the job most recently started are displayed to the right of the chord diagram. Summary information and status of all jobs is displayed in the table beneath these two views.

At the moment it only works with Pig; however, the framework is extensible and allows support for other other runtimes. We plan to support Cascading and Scalding, but we welcome patches for other runtimes as well. Ambrose also relies on a number of other great open-source projects including Jetty, D3.js, and Twitter Bootstrap.

In its current form Ambrose is still early in development and has a growing list of features we’d love to add, but we’ve open sourced it to develop Ambrose in the open and get community feedback. We encourage you to download it and let us know what you think. If you’re interested in working on and evolving data visualization tools like Ambrose, join the flock. In the end, we’d love to hear your feedback — Tweet us at @Ambrose or file an issue.

- Chris Aniszczyk, Manager of Open Source (@cra)