Computing in the RAIN: A Reliable Array of Independent Nodes

Authors:
Vasken Bohossian;Chenggong C. Fan;Paul S. LeMahieu;Marc D. Riedel;Jehoshua Bruck;Lihao Xu
Affiliations:
Rainfinity, Pasadena, CA;California Institute of Technology, Pasadena;California Institute of Technology, Pasadena;California Institute of Technology, Pasadena;California Institute of Technology, Pasadena;Washington Univ., St. Louis, MO
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2001

Citing 41
Cited 9

(N, K) Concept Fault Tolerance

IEEE Transactions on Computers - The MIT Press scientific computation series
Coda: A Highly Available File System for a Distributed Workstation Environment

IEEE Transactions on Computers
On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures

IEEE Transactions on Computers
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Designing fault-tolerant systems using automorphisms

Journal of Parallel and Distributed Computing
Automatic reconfiguration in Autonet

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Some Practical Issues in the Design of Fault-Tolerant Multiprocessors

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Streaming RAID: a disk array management system for video files

MULTIMEDIA '93 Proceedings of the first ACM international conference on Multimedia
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
The PVM concurrent computing system: evolution, experiences, and trends

Parallel Computing - Special issue: message passing interfaces
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The Zebra striped network file system

ACM Transactions on Computer Systems (TOCS)
The Totem single-ring ordering and membership protocol

ACM Transactions on Computer Systems (TOCS)
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
The Transis approach to high availability cluster communication

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
A new look at membership services (extended abstract)

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
A case study: automatic reconfiguration in Autonet

Distributed systems (2nd Ed.)
Distributed Algorithms

Distributed Algorithms
MPI: The Complete Reference

MPI: The Complete Reference
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
A Case for NOW (Networks of Workstations)

IEEE Micro
Reliability Through Consistency

IEEE Software
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares

IEEE Transactions on Computers
Connective Fault Tolerance in Multiple-Bus Systems

IEEE Transactions on Parallel and Distributed Systems
Dome: Parallel Programming in a Distributed Computing Environment

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Consistent History Link Connectivity Protocol

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The Design of the Transis System

Selected Papers from the International Workshop on Theory and Practice in Distributed Systems
The Scotch parallel storage systems

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
The SunSCALR Framework for Internet Servers

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols

PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Tolerant Switched Local Area Networks

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
X-code: MDS array codes with optimal encoding

IEEE Transactions on Information Theory
Low-density MDS codes and factors of complete graphs

IEEE Transactions on Information Theory

The Raincore API for Clusters of Networking Elements

IEEE Internet Computing
A Group Membership Algorithm with a Practical Specification

IEEE Transactions on Parallel and Distributed Systems
Survivable Computer Networks in the Presence of Partitioning

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Deconstructing Commodity Storage Clusters

Proceedings of the 32nd annual international symposium on Computer Architecture
Hydra: a platform for survivable and secure data storage systems

Proceedings of the 2005 ACM workshop on Storage security and survivability
STAR: an efficient coding scheme for correcting triple storage node failures

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
DiskReduce: RAID for data-intensive scalable computing

Proceedings of the 4th Annual Workshop on Petascale Data Storage
F-code: an optimized MDS array code

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Extensible block-level storage virtualization in cluster-based systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology.