Survey of software tools for evaluating reliability, availability, and serviceability
ACM Computing Surveys (CSUR)
ACM Computing Surveys (CSUR)
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys (CSUR)
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Building Survivable Services Using Redundancy and Adaptation
IEEE Transactions on Computers
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
ACM Transactions on Computer Systems (TOCS)
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Practical Approach for Zero' Downtime in an Operational Information System
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A scalable distributed information management system
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Quickly Finding Known Software Problems via Automated Symptom Matching
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Monitoring Large Systems Via Statistical Sampling
International Journal of High Performance Computing Applications
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
CoMon: a mostly-scalable monitoring system for PlanetLab
ACM SIGOPS Operating Systems Review
Autonomous recovery in componentized Internet applications
Cluster Computing
Stardust: tracking activity in a distributed storage system
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Performance modeling and system management for multi-component online services
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Reinforcement Learning Approach to Automatic Error Recovery
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
On the Quality of Service of Crash-Recovery Failure Detectors
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
E2EProf: Automated End-to-End Performance Management for Enterprise Systems
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Performability Models for Multi-Server Systems with High-Variance Repair Durations
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Whodunit: transactional profiling for multi-tier applications
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Computer system performance problem detection using time series models
Usenix-stc'93 Proceedings of the USENIX Summer 1993 Technical Conference on Summer technical conference - Volume 1
Tracking in a spaghetti bowl: monitoring transactions using footprints
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DARC: dynamic analysis of root causes of latency distributions
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey of autonomic computing—degrees, models, and applications
ACM Computing Surveys (CSUR)
San Fermín: aggregating large data sets using a binomial swap forest
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Guided Problem Diagnosis through Active Learning
ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
Moara: flexible and scalable group-based querying system
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
iManage: policy-driven self-management for enterprise-scale systems
Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Online Anomaly Prediction for Robust Cluster Systems
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Self-correlating predictive information tracking for large-scale production systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
vManage: loosely coupled platform and virtualization management in data centers
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
NAP: a building block for remediating performance bottlenecks via black box network analysis
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Reference-driven performance anomaly identification
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems
ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
EbAT: online methods for detecting utility cloud anomalies
Proceedings of the 6th Middleware Doctoral Symposium
Ganesha: blackBox diagnosis of MapReduce systems
ACM SIGMETRICS Performance Evaluation Review
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
Toward automatic policy refinement in repair services for large distributed systems
ACM SIGOPS Operating Systems Review
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems
Proceedings of the 7th international conference on Autonomic computing
Monalytics: online monitoring and analytics for managing large scale data centers
Proceedings of the 7th international conference on Autonomic computing
A query language and runtime tool for evaluating behavior of multi-tier servers
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Scaling a monitoring infrastructure for the Akamai network
ACM SIGOPS Operating Systems Review
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Network imprecision: a new consistency metric for scalable monitoring
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Automating network application dependency discovery: experiences, limitations, and new solutions
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Mining invariants from console logs for system problem detection
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
On Predictability of System Anomalies in Real World
MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Intrusion recovery using selective re-execution
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters
Proceedings of the sixth conference on Computer systems
Diagnosing performance changes by comparing request flows
Proceedings of the 8th USENIX conference on Networked systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Autonomic computing: the first decade
Proceedings of the 8th ACM international conference on Autonomic computing
A flexible architecture integrating monitoring and analytics for managing large-scale data centers
Proceedings of the 8th ACM international conference on Autonomic computing
Fay: extensible distributed tracing from kernels to clusters
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Draco: Statistical diagnosis of chronic problems in large distributed systems
DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Net-cohort: detecting and managing VM ensembles in virtualized data centers
Proceedings of the 9th international conference on Autonomic computing
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Faster, larger, easier: reining real-time big data processing in cloud
Proceedings of the Posters and Demo Track
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Root cause detection in a service-oriented architecture
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Hi-index | 0.00 |