Hunting for problems with Artemis

Authors:
Gabriela F. Creţu-Ciocârlie;Mihai Budiu;Moises Goldszmidt
Affiliations:
Microsoft Research, Silicon Valley;Microsoft Research, Silicon Valley;Microsoft Research, Silicon Valley
Venue:
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Year:
2008

Citing 10
Cited 14

Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases

IEEE Transactions on Visualization and Computer Graphics
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression

The Journal of Machine Learning Research
Bad Words: Finding Faults in Spirit's Syslogs

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
HiLighter: automatically building robust signatures of performance behavior for small- and large-scale systems

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Mining console logs for large-scale system problem detection

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques

Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Toward automatic policy refinement in repair services for large distributed systems

ACM SIGOPS Operating Systems Review
MR-scope: a real-time tracing tool for MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
More intervention now!

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
Towards quantitative analysis of data intensive computing: a case study of Hadoop

Proceedings of the 8th ACM international conference on Autonomic computing
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Understanding and improving the diagnostic workflow of MapReduce users

CHIMIT '11 Proceedings of the 5th ACM Symposium on Computer Human Interaction for Management of Information Technology
HiTune: dataflow-based performance analysis for big data cloud

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Putting a "big-data" platform to good use: training kinect

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Theia: visual signatures for problem diagnosis in large hadoop clusters

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Structured and Interoperable Logging for the Cloud Computing Era: The Pitfalls and Benefits

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Artemis is a modular application designed for analyzing and troubleshooting the performance of large clusters running datacenter services. Artemis is composed of four modules: (1) distributed log collection and data extraction, (2) a database storing the extracted data, (3) an interactive visualization tool for exploring the data, and (4) a plug-in interface (and a set of sample plug-ins) allowing users to implement data analysis tools including (a) the extraction and construction of new features from the basic measurements collected, and (b) the implementation and invocation of statistical and machine learning algorithms and tools. In this paper we describe each of these components and then we illustrate the power of the plug-in architecture by presenting a case-study using Artemis to analyze a Dryad application running on a 240-machine cluster.