Stardust: tracking activity in a distributed storage system

Authors:
Eno Thereska;Brandon Salmon;John Strunk;Matthew Wachs;Michael Abd-El-Malek;Julio Lopez;Gregory R. Ganger
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Year:
2006

Citing 21
Cited 32

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
Mache: no-loss trace compaction

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Trace-based mobile network emulation

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Capacity planning for Web performance: metrics, models, and methods

Capacity planning for Web performance: metrics, models, and methods
A trace-driven analysis of the UNIX 4.2 BSD file system

Proceedings of the tenth ACM symposium on Operating systems principles
Extensible, Scalable Monitoring for Clusters of Computers

LISA '97 Proceedings of the 11th Conference on Systems Administration
Automatic Generation of a Software Performance Model Using an Object-Oriented Prototype

MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
ETE: A Customizable Approach to Measuring End-to-End Response Times and Their Components in Distributed Systems

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Efficient Byzantine-Tolerant Erasure-Coded Storage

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Passive NFS Tracing of Email and Research Workloads

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Hibernator: helping disk arrays sleep through the winter

Proceedings of the twentieth ACM symposium on Operating systems principles
Request extraction in Magpie: events, schemas and temporal joins

Proceedings of the 11th workshop on ACM SIGOPS European workshop
A read/write protocol family for versatile storage infrastructures

A read/write protocol family for versatile storage infrastructures
Dynamic instrumentation of production systems

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Ursa minor: versatile cluster-based storage

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Informed data distribution selection in a self-predicting storage system

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing

Exploiting nonstationarity for performance prediction

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Categorizing and differencing system behaviours

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Observer: keeping system models from becoming obsolete

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Frequent pattern mining for kernel trace data

Proceedings of the 2008 ACM symposium on Applied computing
Ironmodel: robust performance models in the wild
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Evaluating similarity-based trace reduction techniques for scalable performance analysis

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
DSF: a common platform for distributed systems research and development

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Do you know your IQ?: a research agenda for information quality in systems

ACM SIGMETRICS Performance Evaluation Review
A load balancing framework for clustered storage systems

HiPC'08 Proceedings of the 15th international conference on High performance computing
DSF: a common platform for distributed systems research and development

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Application-storage discovery

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Network imprecision: a new consistency metric for scalable monitoring

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
MT-WAVE: profiling multi-tier web applications

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
Italian for beginners: the next steps for SLO-based management

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Modeling the parallel execution of black-box services

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
What is my program doing? program dynamics in programmer's terms

RV'11 Proceedings of the Second international conference on Runtime verification
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Responding rapidly to service level violations using virtual appliances

ACM SIGOPS Operating Systems Review
Towards I/O analysis of HPC systems and a generic architecture to collect access patterns

Computer Science - Research and Development
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
IOFlow: a software-defined storage architecture

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Seeing through black boxes: Tracking transactions through queues under monitoring resource constraints

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance metrics. Such traces better inform key administrative performance challenges by enabling, for example, extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper reports on our experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system. Using diverse system workloads and scenarios, we show that such fine-grained tracing can be made efficient (less than 6% overhead) and is useful for on- and off-line analysis of system behavior. These experiences make a case for having other systems incorporate such an instrumentation framework.