Visualizing Hadoop with HDFS-DU

Tuesday, 7 August 2012

We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS). The project aims to monitor different snapshots for the entire HDFS system in an interactive way, showing the size of the folders and the rate at which the size changes. It can also effectively identify efficient and inefficient file storage and highlight nodes in the file system where this is happening.

HDFS-DU provides the following in a web user interface:

  • A TreeMap visualization where each node is a folder in HDFS. The area of each node can be relative to the size or number of descendents
  • A tree visualization showing the topology of the file system

HDFS-DU is built using the following front-end technologies:

Details

Below is a screenshot of the HDFS-DU user interface (directory names scrubbed). The user interface is made up of two linked visualizations. The left visualization is a TreeMap and shows parent-child relationships through containment. The right visualization is a tree layout, which displays two levels of depth from the current selected node in the file system. The tree visualization displays extra information for each node on hover.

Visualizing Hadoop with HDFS-DU

You can drill down on the TreeMap by clicking on a node, this would create the same effect as clicking on any tree node. There are two possible layouts for the TreeMap. The default one encodes file size in the area of each node. The second one encodes number of descendents in the area of each node. In the second view it’s interesting to spot nodes where storage is inefficient.

Visualizing Hadoop with HDFS-DU

Future Work

This project was created at our July Hack Week and we still consider it beta but useful software. In the future, we would love to improve the front-end client and create a new back-end for a different runtime environment. On the front end, the directory browser, currently on the right, is poorly suited to the task of showing the directory structure. A view which looks more like a traditional filesystem browser would be more immediately recognizable and make better use of space (it is likely that a javascript file browser exists and could be used instead). Also, the integration between the current file browser and the TreeMap needs improvement.

We initially envisioned the TreeMap as a Voronoi TreeMap, however our current implementation of that code ran too slowly to be practical. We would love to get the Voronoi TreeMap code to work fast enough. We would also like to add the option to use different values to size and color the TreeMap areas. For example, change in size, creation time, last access time, frequency of access.

Acknowledgements

HDFS-DU was primarily authored by Travis Crawford (@tc), Nicolas Garcia Belmonte (@philogb) and Robert Harris (@trebor). Given that this is a young project, we always appreciate bug fixes, features and documentation improvements. Feel free to fork the project and send us a pull request on GitHub to say hello. Finally, if you’re interested in visualization and distributed file systems like Hadoop, we’re always looking for engineers to join the flock.

Follow @hdfsdu on Twitter to stay in touch!

- Chris Aniszczyk, Manager of Open Source (@cra)