Implementing Reliable Data Structures for MPI Services in High Component Count Systems

Authors:
Justin M. Wozniak;Bryan Jacobs;Robert Latham;Sam Lang;Seung Woo Son;Robert Ross
Affiliations:
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, USA 60439
Venue:
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2009

Citing 7
Cited 1

Kademlia: A Peer-to-Peer Information System Based on the XOR Metric

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
Can MPI be used for persistent parallel services?

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Open issues in MPI implementation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

JETS: Language and System Support for Many-Parallel-Task Workflows

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.