Computing in the RAIN: A Reliable Array of Independent Nodes

Authors:
Vasken Bohossian;Chenggong Charles Fan;Paul S. LeMahieu;Marc D. Riedel;Lihao Xu;Jehoshua Bruck
Affiliations:
-;-;-;-;-;-
Venue:
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Year:
2000

Citing 19
Cited 0

(N, K) Concept Fault Tolerance

IEEE Transactions on Computers - The MIT Press scientific computation series
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
The Transis approach to high availability cluster communication

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
A new look at membership services (extended abstract)

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
MPI: The Complete Reference

MPI: The Complete Reference
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
A Case for NOW (Networks of Workstations)

IEEE Micro
A Consistent History Link Connectivity Protocol

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols

PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Tolerant Switched Local Area Networks

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
X-code: MDS array codes with optimal encoding

IEEE Transactions on Information Theory
Low-density MDS codes and factors of complete graphs

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple in terfacesto networks configured in fault-tolerant topologies. The RAIN softw arecomponents run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiplenode, link, and switch failures, with no single point of failure. The RAIN technology has been transfered to RAINfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures; 2) fault management techniques based on group membership; and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.