(N, K) Concept Fault Tolerance
IEEE Transactions on Computers - The MIT Press scientific computation series
Coda: A Highly Available File System for a Distributed Workstation Environment
IEEE Transactions on Computers
On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures
IEEE Transactions on Computers
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Designing fault-tolerant systems using automorphisms
Journal of Parallel and Distributed Computing
Automatic reconfiguration in Autonet
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Some Practical Issues in the Design of Fault-Tolerant Multiprocessors
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Streaming RAID: a disk array management system for video files
MULTIMEDIA '93 Proceedings of the first ACM international conference on Multimedia
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
The PVM concurrent computing system: evolution, experiences, and trends
Parallel Computing - Special issue: message passing interfaces
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures
IEEE Transactions on Computers - Special issue on fault-tolerant computing
The Zebra striped network file system
ACM Transactions on Computer Systems (TOCS)
The Totem single-ring ordering and membership protocol
ACM Transactions on Computer Systems (TOCS)
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
The Transis approach to high availability cluster communication
Communications of the ACM
Horus: a flexible group communication system
Communications of the ACM
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
A new look at membership services (extended abstract)
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
A case study: automatic reconfiguration in Autonet
Distributed systems (2nd Ed.)
Distributed Algorithms
MPI: The Complete Reference
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
A Case for NOW (Networks of Workstations)
IEEE Micro
Reliability Through Consistency
IEEE Software
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares
IEEE Transactions on Computers
Connective Fault Tolerance in Multiple-Bus Systems
IEEE Transactions on Parallel and Distributed Systems
Dome: Parallel Programming in a Distributed Computing Environment
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Consistent History Link Connectivity Protocol
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The Design of the Transis System
Selected Papers from the International Workshop on Theory and Practice in Distributed Systems
The Scotch parallel storage systems
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
The SunSCALR Framework for Internet Servers
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols
PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Tolerant Switched Local Area Networks
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
X-code: MDS array codes with optimal encoding
IEEE Transactions on Information Theory
Low-density MDS codes and factors of complete graphs
IEEE Transactions on Information Theory
The Raincore API for Clusters of Networking Elements
IEEE Internet Computing
A Group Membership Algorithm with a Practical Specification
IEEE Transactions on Parallel and Distributed Systems
Survivable Computer Networks in the Presence of Partitioning
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Hydra: a platform for survivable and secure data storage systems
Proceedings of the 2005 ACM workshop on Storage security and survivability
STAR: an efficient coding scheme for correcting triple storage node failures
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
DiskReduce: RAID for data-intensive scalable computing
Proceedings of the 4th Annual Workshop on Petascale Data Storage
F-code: an optimized MDS array code
ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Extensible block-level storage virtualization in cluster-based systems
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology.