Theia: visual signatures for problem diagnosis in large hadoop clusters

Authors:
Elmer Garduno;Soila P. Kavulya;Jiaqi Tan;Rajeev Gandhi;Priya Narasimhan
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Year:
2012

Citing 22
Cited 1

Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
LiveRAC: interactive visual exploration of system management time-series data

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Towards automated performance diagnosis in a large IPTV network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Visual Thinking: for Design

Visual Thinking: for Design
Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
CueT: human-guided fast and accurate network alarm triage

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Apolo: making sense of large network data by combining rich user interaction and machine learning

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
D3 Data-Driven Documents

IEEE Transactions on Visualization and Computer Graphics
Understanding and improving the diagnostic workflow of MapReduce users

CHIMIT '11 Proceedings of the 5th ACM Symposium on Computer Human Interaction for Management of Information Technology

Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Diagnosing performance problems in large distributed systems can be daunting as the copious volume of monitoring information available can obscure the root-cause of the problem. Automated diagnosis tools help narrow down the possible root-causes--however, these tools are not perfect thereby motivating the need for visualization tools that allow users to explore their data and gain insight on the root-cause. In this paper we describe Theia, a visualization tool that analyzes application-level logs in a Hadoop cluster, and generates visual signatures of each job's performance. These visual signatures provide compact representations of task durations, task status, and data consumption by jobs. We demonstrate the utility of Theia on real incidents experienced by users on a production Hadoop cluster.