Lessons from Giant-Scale Services
IEEE Internet Computing
Architecture and Dependability of Large-Scale Internet Services
IEEE Internet Computing
Software Dependability in the Tandem GUARDIAN System
IEEE Transactions on Software Engineering
Experimental Study of Internet Stability and Backbone Failures
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checking system rules using system-specific, programmer-written compiler extensions
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
The cutting EDGE of IP router configuration
ACM SIGCOMM Computer Communication Review
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Combining routing and traffic data for detection of IP forwarding anomalies
Proceedings of the joint international conference on Measurement and modeling of computer systems
A wavelet-based framework for proactive detection of network misconfigurations
Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
IP forwarding anomalies and improving their detection using multiple data sources
Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
Supporting Cluster-Based Network Services on Functionally Symmetric Software Architecture
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A clean slate 4D approach to network control and management
ACM SIGCOMM Computer Communication Review
AOSD for internet service clusters: the case of availability
AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
HANet: a framework toward ultimately reliable network services
Journal of Systems and Software
Autonomous recovery in componentized Internet applications
Cluster Computing
Analyzing persistent state interactions to improve state management
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Towards human-centred design: Two case studies
Journal of Systems and Software
Business processes for web services: principles and applications
IBM Systems Journal
CONMan: taking the complexity out of network management
Proceedings of the 2006 SIGCOMM workshop on Internet network management
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems
IEEE Transactions on Dependable and Secure Computing
A: an assertion language for distributed systems
Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Failover, load sharing and server architecture in SIP telephony
Computer Communications
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
A digital preservation network appliance based on OpenBSD
BSDC'03 Proceedings of the BSD Conference 2003 on BSD Conference
Human-aware computer system design
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Network configuration management via model finding
LISA '05 Proceedings of the 19th conference on Large Installation System Administration Conference - Volume 19
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
SkipNet: a scalable overlay network with practical locality properties
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Automatic configuration of internet services
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
CONMan: a step towards network manageability
Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Zyzzyva: speculative byzantine fault tolerance
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PeerReview: practical accountability for distributed systems
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Flight data recorder: monitoring persistent-state interactions to improve systems management
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Distributed directory service in the Farsite file system
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems
IEEE Transactions on Knowledge and Data Engineering
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping
IEEE Transactions on Knowledge and Data Engineering
Reengineering J2EE Servers for Automated Management in Distributed Environments
IEEE Distributed Systems Online
Snitch: interactive decision trees for troubleshooting misconfigurations
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SafeStore: a durable and practical storage system
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Using components for architecture-based management: the self-repair case
Proceedings of the 30th international conference on Software engineering
Integrated system models for reliable petascale storage systems
PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
DieCast: testing distributed systems with an accurate scale model
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
A policy-aware switching layer for data centers
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Shadow configuration as a network management primitive
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A Runtime Constraint-Aware Solution for Automated Refinement of IT Change Plans
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
The Role of Field Data for Analyzing the Dependability of Short Range Wireless Technologies
SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
Towards automatic reverse engineering of software security configurations
Proceedings of the 15th ACM conference on Computer and communications security
Declarative Infrastructure Configuration Synthesis and Debugging
Journal of Network and Systems Management
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Active Diagnosis of High-Level Faults in Distributed Internet Services
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
High-available grid services through the use of virtualized clustering
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Unraveling the complexity of network management
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Combining Virtual Organization and Local Policies for Automated Configuration of Grid Services
Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach
Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Software Engineering for Self-Adaptive Systems
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Remote network labs: an on-demand network cloud for configuration testing
Proceedings of the 1st ACM workshop on Research on enterprise networking
How to find self-inflicted troubles
Journal of Computing Sciences in Colleges
NetPiler: detection of ineffective router configurations
IEEE Journal on Selected Areas in Communications - Special issue on network infrastructure configuration
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
EbAT: online methods for detecting utility cloud anomalies
Proceedings of the 6th Middleware Doctoral Symposium
Remote network labs: an on-demand network cloud for configuration testing
ACM SIGCOMM Computer Communication Review
CHANGEMINER: a solution for discovering IT change templates from past execution traces
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Improving IT Change Management Processes with Automated Risk Assessment
DSOM '09 Proceedings of the 20th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Integrated Management of Systems, Services, Processes and People in IT
IT and infrastructure's lost dependability
SE '08 Proceedings of the IASTED International Conference on Software Engineering
Telehealth/AT '08 Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies
CatchAndRetry: extending exceptions to handle distributed system failures and recovery
Proceedings of the Fifth Workshop on Programming Languages and Operating Systems
Barricade: defending systems against operator mistakes
Proceedings of the 5th European conference on Computer systems
The challenges of application service hosting
ICWE'07 Proceedings of the 7th international conference on Web engineering
Are clouds ready for large distributed applications?
ACM SIGOPS Operating Systems Review
Service combinators for farming virtual machines
COORDINATION'08 Proceedings of the 10th international conference on Coordination models and languages
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Dynamic updates for web and cloud applications
APLWACA '10 Proceedings of the 2010 Workshop on Analysis and Programming Languages for Web Applications and Cloud Applications
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
MMS: an autonomic network-layer foundation for network management
IEEE Journal on Selected Areas in Communications
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Toward quantifying system manageability
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Empirical comparison of techniques for automated failure diagnosis
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
BasisDetect: a model-based network event detection framework
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
A survey of system configuration tools
LISA'10 Proceedings of the 24th international conference on Large installation system administration
The margrave tool for firewall analysis
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Analyzing web logs to detect user-visible failures
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
DieCast: Testing Distributed Systems with an Accurate Scale Model
ACM Transactions on Computer Systems (TOCS)
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
COSCA: an easy-to-use component-based PaaS cloud system for common applications
Proceedings of the First International Workshop on Cloud Computing Platforms
Using hierarchal change mining to manage network security policy evolution
Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Tesseract: a 4D network control plane
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
From Autonomic to Self-Self Behaviors: The JADE Experience
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Depot: Cloud Storage with Minimal Trust
ACM Transactions on Computer Systems (TOCS)
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Firewall policy change-impact analysis
ACM Transactions on Internet Technology (TOIT)
Integrated management of network and security devices in IT infrastructures
Proceedings of the 7th International Conference on Network and Services Management
Simplifying autonomic enterprise java bean applications via model-driven development: a case study
MoDELS'05 Proceedings of the 8th international conference on Model Driven Engineering Languages and Systems
Case-based reasoning for autonomous service failure diagnosis and remediation in software systems
ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
End-user perspectives of Internet connectivity problems
Computer Networks: The International Journal of Computer and Telecommunications Networking
Impact analysis of BGP sessions for prioritization of maintenance operations
Computer Networks: The International Journal of Computer and Telecommunications Networking
Diagnosis of software failures using computational geometry
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
SP 800-144. Guidelines on Security and Privacy in Public Cloud Computing
SP 800-144. Guidelines on Security and Privacy in Public Cloud Computing
RCourse: a robustness benchmarking suite for publish/subscribe overlay simulations with Peersim
Proceedings of the First Workshop on P2P and Dependability
Transparent VPN failure recovery with virtualization
Future Generation Computer Systems
Separating Performance Anomalies from Workload-Explained Failures in Streaming Servers
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A tutorial on reliability in publish/subscribe services
Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Improving manageability through reorganization of routing-policy configurations
Computer Networks: The International Journal of Computer and Telecommunications Networking
Cache-Based Query Processing for Search Engines
ACM Transactions on the Web (TWEB)
Automatic undo for cloud management via AI planning
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Change-impact analysis of firewall policies
ESORICS'07 Proceedings of the 12th European conference on Research in Computer Security
ACM SIGCOMM Computer Communication Review
Internet on the move: challenges and solutions
ACM SIGCOMM Computer Communication Review
A declarative approach to automated configuration
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
XUTools: UNIX commands for processing next-generation structured text
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Survey On reliability in publish/subscribe services
Computer Networks: The International Journal of Computer and Telecommunications Networking
Failure recovery: when the cure is worse than the disease
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Quantifying and verifying reachability for access controlled networks
IEEE/ACM Transactions on Networking (TON)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Do not blame users for misconfigurations
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
An automated system for emulated network experimentation
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Detecting cloud provisioning errors using an annotated process model
Proceedings of the 8th Workshop on Middleware for Next Generation Internet Computing
EnCore: exploiting system environment and correlation information for misconfiguration detection
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Supporting undoability in systems operations
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures [6]. Seventeen years later, Internet services have replaced fault-tolerant servers as the new kid on the 24×7-availability block. Using data from three large-scale Internet services, we analyzed the causes of their failures and the (potential) effectiveness of various techniques for preventing and mitigating service failure. We find that (1) operator error is the largest cause of failures in two of the three services, (2) operator errors is the largest contributor to time to repair in two of the three services, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems.