I/O system performance debugging using model-driven anomaly characterization

Authors:
Kai Shen;Ming Zhong;Chuanpeng Li
Affiliations:
Department of Computer Science, University of Rochester;Department of Computer Science, University of Rochester;Department of Computer Science, University of Rochester
Venue:
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Year:
2005

Citing 25
Cited 16

An evaluation of redundant arrays of disks using an Amdahl 5890

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Performance assertion checking

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
An introduction to disk drive modeling

Computer
Scheduling algorithms for modern disk drives

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
On-line extraction of SCSI disk drive parameters

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A study of integrated prefetching and caching strategies

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Informed multi-process prefetching and caching

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cello: a disk scheduling framework for next generation operating systems

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
An analytic behavior model for disk drives with readahead caches and request reordering

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Modeling and optimizing I/O throughput of multiple disks on a bus (summary)

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using System-Level Models to Evaluate I/O Subsystem Designs

IEEE Transactions on Computers
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications

IEEE Transactions on Parallel and Distributed Systems
Selecting RAID Levels for Disk Arrays

FAST '02 Proceedings of the Conference on File and Storage Technologies
Mixtures of Rectangles: Interpretable Soft Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Modular, Analytical Throughput Model for Modern Disk Arrays

MASCOTS '01 Proceedings of the Ninth International Symposium in Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
CP-Miner: a tool for finding copy-paste and related bugs in operating system code

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Measuring and characterizing system behavior using kernel-level event logging

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Why does file system prefetching work?

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference

Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Managing prefetch memory for data-intensive online servers

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Ironmodel: robust performance models in the wild
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
AdaptGuard: guarding adaptive systems from instability

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
An Extensible I/O Performance Analysis Framework for Distributed Environments

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is challenging to identify performance problems and pinpoint their root causes in complex systems, especially when the system supports wide ranges of workloads and when performance problems only materialize under particular workload conditions. This paper proposes a model-driven anomaly characterization approach and uses it to discover operating system performance bugs when supporting disk I/O-intensive online servers. We construct a whole-system I/O throughput model as the reference of expected performance and we use statistical clustering and characterization of performance anomalies to guide debugging. Unlike previous performance debugging methods offering detailed statistics at specific execution settings, our approach focuses on comprehensive anomaly characterization over wide ranges of workload conditions and system configurations. Our approach helps us quickly identify four performance bugs in the I/O system of the recent Linux 2.6.10 kernel (one in the file system prefetching, two in the anticipatory I/O scheduler, and one in the elevator I/O scheduler). Our experiments with two Web server benchmarks, a trace-driven index searching server, and the TPC-C database benchmark show that the corrected kernel improves system throughput by up to five-fold compared with the original kernel (averaging 6%, 32%, 39%, and 16% for the four server workloads).