Commercial Fault Tolerance: A Tale of Two Systems

Authors:
Wendy Bartlett;Lisa Spainhower
Affiliations:
IEEE Computer Society;IEEE
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2004

Citing 15
Cited 35

Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
S/390 cluster technology: Parallel Sysplex

IBM Systems Journal
S/390 CMOS server I/O: the continuing evolution

IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
A high-frequency custom CMOS S/390 microprocessor

IBM Journal of Research and Development - Special issue: IBM S/390 G3 and G4
The nucleus of a multiprogramming system

Communications of the ACM
The structure of the “THE”-multiprogramming system

Communications of the ACM
Architecture and Dependability of Large-Scale Internet Services

IEEE Internet Computing
TNet: A Reliable System Area Network

IEEE Micro
Software Dependability in the Tandem GUARDIAN System

IEEE Transactions on Software Engineering
The Vision of Autonomic Computing

Computer
Impact of Deep Submicron Technology on Dependability of VLSI Circuits

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
The S/390 G5/G6 binodal cache

IBM Journal of Research and Development
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development

IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Ensuring data integrity in storage: techniques and applications

Proceedings of the 2005 ACM workshop on Storage security and survivability
Truss: A Reliable, Scalable Server Architecture

IEEE Micro
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Isolation in Commodity Multicore Processors

Computer
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Exploiting type-awareness in a self-recovering disk

Proceedings of the 2007 ACM workshop on Storage security and survivability
The effects of metadata corruption on nfs

Proceedings of the 2007 ACM workshop on Storage security and survivability
Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Globally optimized robust systems to overcome scaled CMOS reliability challenges

Proceedings of the conference on Design, automation and test in Europe
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
The StageNet fabric for constructing resilient multicore systems

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
BFT: the time is now

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Sequential element design with built-in soft error resilience

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
System-on-Chip Test Architectures: Nanometer Design for Testability

System-on-Chip Test Architectures: Nanometer Design for Testability
Architecture Design for Soft Errors

Architecture Design for Soft Errors
On soft error rate analysis of scaled CMOS designs: a statistical perspective

Proceedings of the 2009 International Conference on Computer-Aided Design
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Necromancer: enhancing system throughput by animating dead cores

Proceedings of the 37th annual international symposium on Computer architecture
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
CuriOS: improving reliability through operating system structure

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Erasing Core Boundaries for Robust and Configurable Performance

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A file is not a file: understanding the I/O behavior of Apple desktop applications

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Statistical Soft Error Rate (SSER) Analysis for Scaled CMOS Designs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Towards reliable storage systems

Towards reliable storage systems
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
A File Is Not a File: Understanding the I/O Behavior of Apple Desktop Applications

ACM Transactions on Computer Systems (TOCS)
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Ffsck: The Fast File-System Checker

ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop® Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a "fail-fast驴 philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails.