Why do internet services fail, and what can be done about it?

Authors:
David Oppenheimer;Archana Ganapathi;David A. Patterson
Affiliations:
University of California at Berkeley, EECS Computer Science Division, Berkeley, CA;University of California at Berkeley, EECS Computer Science Division, Berkeley, CA;University of California at Berkeley, EECS Computer Science Division, Berkeley, CA
Venue:
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Year:
2003

Citing 11
Cited 133

Lessons from Giant-Scale Services

IEEE Internet Computing
Architecture and Dependability of Large-Scale Internet Services

IEEE Internet Computing
Sources of Failure in the Public Switched Telephone Network

Computer
Software Dependability in the Tandem GUARDIAN System

IEEE Transactions on Software Engineering
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checking system rules using system-specific, programmer-written compiler extensions

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4

The cutting EDGE of IP router configuration

ACM SIGCOMM Computer Communication Review
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Combining routing and traffic data for detection of IP forwarding anomalies

Proceedings of the joint international conference on Measurement and modeling of computer systems
A wavelet-based framework for proactive detection of network misconfigurations

Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
IP forwarding anomalies and improving their detection using multiple data sources

Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
Supporting Cluster-Based Network Services on Functionally Symmetric Software Architecture

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A clean slate 4D approach to network control and management

ACM SIGCOMM Computer Communication Review
AOSD for internet service clusters: the case of availability

AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
HANet: a framework toward ultimately reliable network services

Journal of Systems and Software
Autonomous recovery in componentized Internet applications

Cluster Computing
Analyzing persistent state interactions to improve state management

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Towards human-centred design: Two case studies

Journal of Systems and Software
Business processes for web services: principles and applications

IBM Systems Journal
CONMan: taking the complexity out of network management

Proceedings of the 2006 SIGCOMM workshop on Internet network management
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

IEEE Transactions on Dependable and Secure Computing
A: an assertion language for distributed systems

Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Failover, load sharing and server architecture in SIP telephony

Computer Communications
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
A digital preservation network appliance based on OpenBSD

BSDC'03 Proceedings of the BSD Conference 2003 on BSD Conference
Human-aware computer system design

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Network configuration management via model finding

LISA '05 Proceedings of the 19th conference on Large Installation System Administration Conference - Volume 19
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
SkipNet: a scalable overlay network with practical locality properties

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Automatic configuration of internet services

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
CONMan: a step towards network manageability

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Distributed directory service in the Farsite file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping

IEEE Transactions on Knowledge and Data Engineering
Reengineering J2EE Servers for Automated Management in Distributed Environments

IEEE Distributed Systems Online
Snitch: interactive decision trees for troubleshooting misconfigurations

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SafeStore: a durable and practical storage system

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Using components for architecture-based management: the self-repair case

Proceedings of the 30th international conference on Software engineering
Integrated system models for reliable petascale storage systems

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
DieCast: testing distributed systems with an accurate scale model

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
A policy-aware switching layer for data centers

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Shadow configuration as a network management primitive

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A Runtime Constraint-Aware Solution for Automated Refinement of IT Change Plans

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
The Role of Field Data for Analyzing the Dependability of Short Range Wireless Technologies

SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
Towards automatic reverse engineering of software security configurations

Proceedings of the 15th ACM conference on Computer and communications security
Declarative Infrastructure Configuration Synthesis and Debugging

Journal of Network and Systems Management
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Active Diagnosis of High-Level Faults in Distributed Internet Services

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
High-available grid services through the use of virtualized clustering

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
BFT: the time is now

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Unraveling the complexity of network management

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Combining Virtual Organization and Local Policies for Automated Configuration of Grid Services

Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach

Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Using Filtered Cartesian Flattening and Microrebooting to Build Enterprise Applications with Self-adaptive Healing

Software Engineering for Self-Adaptive Systems
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Remote network labs: an on-demand network cloud for configuration testing

Proceedings of the 1st ACM workshop on Research on enterprise networking
How to find self-inflicted troubles

Journal of Computing Sciences in Colleges
NetPiler: detection of ineffective router configurations

IEEE Journal on Selected Areas in Communications - Special issue on network infrastructure configuration
Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
EbAT: online methods for detecting utility cloud anomalies

Proceedings of the 6th Middleware Doctoral Symposium
Remote network labs: an on-demand network cloud for configuration testing

ACM SIGCOMM Computer Communication Review
CHANGEMINER: a solution for discovering IT change templates from past execution traces

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Improving IT Change Management Processes with Automated Risk Assessment

DSOM '09 Proceedings of the 20th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Integrated Management of Systems, Services, Processes and People in IT
IT and infrastructure's lost dependability

SE '08 Proceedings of the IASTED International Conference on Software Engineering
Communicating security policies to trusted e-health information systems: a specification process based approach

Telehealth/AT '08 Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies
CatchAndRetry: extending exceptions to handle distributed system failures and recovery

Proceedings of the Fifth Workshop on Programming Languages and Operating Systems
Barricade: defending systems against operator mistakes

Proceedings of the 5th European conference on Computer systems
The challenges of application service hosting

ICWE'07 Proceedings of the 7th international conference on Web engineering
Are clouds ready for large distributed applications?

ACM SIGOPS Operating Systems Review
Service combinators for farming virtual machines

COORDINATION'08 Proceedings of the 10th international conference on Coordination models and languages
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Dynamic updates for web and cloud applications

APLWACA '10 Proceedings of the 2010 Workshop on Analysis and Programming Languages for Web Applications and Cloud Applications
Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
MMS: an autonomic network-layer foundation for network management

IEEE Journal on Selected Areas in Communications
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Toward quantifying system manageability

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Empirical comparison of techniques for automated failure diagnosis

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
BasisDetect: a model-based network event detection framework

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
A survey of system configuration tools

LISA'10 Proceedings of the 24th international conference on Large installation system administration
The margrave tool for firewall analysis

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Analyzing web logs to detect user-visible failures

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
DieCast: Testing Distributed Systems with an Accurate Scale Model

ACM Transactions on Computer Systems (TOCS)
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
COSCA: an easy-to-use component-based PaaS cloud system for common applications

Proceedings of the First International Workshop on Cloud Computing Platforms
Using hierarchal change mining to manage network security policy evolution

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Tesseract: a 4D network control plane

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
From Autonomic to Self-Self Behaviors: The JADE Experience

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
An empirical study on configuration errors in commercial and open source systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Depot: Cloud Storage with Minimal Trust

ACM Transactions on Computer Systems (TOCS)
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Firewall policy change-impact analysis

ACM Transactions on Internet Technology (TOIT)
Integrated management of network and security devices in IT infrastructures

Proceedings of the 7th International Conference on Network and Services Management
Simplifying autonomic enterprise java bean applications via model-driven development: a case study

MoDELS'05 Proceedings of the 8th international conference on Model Driven Engineering Languages and Systems
Case-based reasoning for autonomous service failure diagnosis and remediation in software systems

ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
End-user perspectives of Internet connectivity problems

Computer Networks: The International Journal of Computer and Telecommunications Networking
Impact analysis of BGP sessions for prioritization of maintenance operations

Computer Networks: The International Journal of Computer and Telecommunications Networking
Diagnosis of software failures using computational geometry

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
SP 800-144. Guidelines on Security and Privacy in Public Cloud Computing

SP 800-144. Guidelines on Security and Privacy in Public Cloud Computing
RCourse: a robustness benchmarking suite for publish/subscribe overlay simulations with Peersim

Proceedings of the First Workshop on P2P and Dependability
Transparent VPN failure recovery with virtualization

Future Generation Computer Systems
Separating Performance Anomalies from Workload-Explained Failures in Streaming Servers

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A tutorial on reliability in publish/subscribe services

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
Improving manageability through reorganization of routing-policy configurations

Computer Networks: The International Journal of Computer and Telecommunications Networking
Cache-Based Query Processing for Search Engines

ACM Transactions on the Web (TWEB)
Automatic undo for cloud management via AI planning

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Change-impact analysis of firewall policies

ESORICS'07 Proceedings of the 12th European conference on Research in Computer Security
Network management game

ACM SIGCOMM Computer Communication Review
Internet on the move: challenges and solutions

ACM SIGCOMM Computer Communication Review
A declarative approach to automated configuration

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
XUTools: UNIX commands for processing next-generation structured text

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Survey On reliability in publish/subscribe services

Computer Networks: The International Journal of Computer and Telecommunications Networking
Failure recovery: when the cure is worse than the disease

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Quantifying and verifying reachability for access controlled networks

IEEE/ACM Transactions on Networking (TON)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Do not blame users for misconfigurations

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
An automated system for emulated network experimentation

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Detecting cloud provisioning errors using an annotated process model

Proceedings of the 8th Workshop on Middleware for Next Generation Internet Computing
EnCore: exploiting system environment and correlation information for misconfiguration detection

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Supporting undoability in systems operations

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures [6]. Seventeen years later, Internet services have replaced fault-tolerant servers as the new kid on the 24×7-availability block. Using data from three large-scale Internet services, we analyzed the causes of their failures and the (potential) effectiveness of various techniques for preventing and mitigating service failure. We find that (1) operator error is the largest cause of failures in two of the three services, (2) operator errors is the largest contributor to time to repair in two of the three services, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems.