Performance troubleshooting in data centers: an annotated bibliography?

Authors:
Chengwel Wang;Soila P. Kavulya;Jiaqi Tan;Liting Hu;Mahendra Kutare;Mike Kasick;Karsten Schwan;Priya Narasimhan;Rajeev Gandhi
Affiliations:
Georgia Institute of Technology;Carnegie Mellon University;Carnegie Mellon University;Georgia Institute of Technology;Boundary, Inc.;Carnegie Mellon University;Georgia Institute of Technology;Carnegie Mellon University;Carnegie Mellon University
Venue:
ACM SIGOPS Operating Systems Review
Year:
2013

Citing 85
Cited 0

Survey of software tools for evaluating reliability, availability, and serviceability

ACM Computing Surveys (CSUR)
Debugging concurrent programs

ACM Computing Surveys (CSUR)
Fundamentals of fault-tolerant distributed computing in asynchronous environments

ACM Computing Surveys (CSUR)
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software-Based Replication for Fault Tolerance

Computer
Building Survivable Services Using Redundancy and Adaptation

IEEE Transactions on Computers
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Practical Approach for Zero' Downtime in an Operational Information System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A scalable distributed information management system

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Failure Diagnosis Using Decision Trees

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Quickly Finding Known Software Problems via Automated Symptom Matching

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Monitoring Large Systems Via Statistical Sampling

International Journal of High Performance Computing Applications
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
CoMon: a mostly-scalable monitoring system for PlanetLab

ACM SIGOPS Operating Systems Review
Autonomous recovery in componentized Internet applications

Cluster Computing
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Discovering likely invariants of distributed transaction systems for autonomic system management

Cluster Computing
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Performance modeling and system management for multi-component online services

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Reinforcement Learning Approach to Automatic Error Recovery

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
On the Quality of Service of Crash-Recovery Failure Detectors

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
E2EProf: Automated End-to-End Performance Management for Enterprise Systems

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Performability Models for Multi-Server Systems with High-Variance Repair Durations

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Whodunit: transactional profiling for multi-tier applications

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Computer system performance problem detection using time series models

Usenix-stc'93 Proceedings of the USENIX Summer 1993 Technical Conference on Summer technical conference - Volume 1
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DARC: dynamic analysis of root causes of latency distributions

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey of autonomic computing—degrees, models, and applications

ACM Computing Surveys (CSUR)
San Fermín: aggregating large data sets using a binomial swap forest

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Guided Problem Diagnosis through Active Learning

ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Moara: flexible and scalable group-based querying system

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
iManage: policy-driven self-management for enterprise-scale systems

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Online Anomaly Prediction for Robust Cluster Systems

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Self-correlating predictive information tracking for large-scale production systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
vManage: loosely coupled platform and virtualization management in data centers

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
NAP: a building block for remediating performance bottlenecks via black box network analysis

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems

ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
EbAT: online methods for detecting utility cloud anomalies

Proceedings of the 6th Middleware Doctoral Symposium
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
Toward automatic policy refinement in repair services for large distributed systems

ACM SIGOPS Operating Systems Review
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
Monalytics: online monitoring and analytics for managing large scale data centers

Proceedings of the 7th international conference on Autonomic computing
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Scaling a monitoring infrastructure for the Akamai network

ACM SIGOPS Operating Systems Review
Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers

IEEE Micro
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Network imprecision: a new consistency metric for scalable monitoring

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Automating network application dependency discovery: experiences, limitations, and new solutions

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
On Predictability of System Anomalies in Real World

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Intrusion recovery using selective re-execution

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters

Proceedings of the sixth conference on Computer systems
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Autonomic computing: the first decade

Proceedings of the 8th ACM international conference on Autonomic computing
A flexible architecture integrating monitoring and analytics for managing large-scale data centers

Proceedings of the 8th ACM international conference on Autonomic computing
Fay: extensible distributed tracing from kernels to clusters

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks
Draco: Statistical diagnosis of chronic problems in large distributed systems

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Net-cohort: detecting and managing VM ensembles in virtualized data centers

Proceedings of the 9th international conference on Autonomic computing
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Faster, larger, easier: reining real-time big data processing in cloud

Proceedings of the Posters and Demo Track
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Root cause detection in a service-oriented architecture

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems

Quantified Score

Hi-index	0.00

Performance troubleshooting in data centers: an annotated bibliography?

Quantified Score

Visualization

Abstract