Citing 22
Cited 0

Synchronized Distributed Termination

IEEE Transactions on Software Engineering
Linda and Friends

Computer
How to write parallel programs: a first course

How to write parallel programs: a first course
Elements of information theory

Elements of information theory
The impossibility of implementing reliable communication in the face of crashes

Journal of the ACM (JACM)
The topological structure of asynchronous computability

Journal of the ACM (JACM)
Parallel programming in OpenMP

Parallel programming in OpenMP
Simulation Modeling and Analysis

Simulation Modeling and Analysis
Computer Networking: A Top-Down Approach Featuring the Internet Package

Computer Networking: A Top-Down Approach Featuring the Internet Package
Distributed Systems: Principles and Paradigms

Distributed Systems: Principles and Paradigms
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
Automatic program parallelization using stateless parallel processing architecture

Automatic program parallelization using stateless parallel processing architecture
Data Flow Supercomputers

Computer
A fault tolerance protocol for stateless parallel processing

A fault tolerance protocol for stateless parallel processing
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Decoupling as a Foundation for Large Scale Parallel Computing

HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports an application dependent network design for extreme scale high performance computing (HPC) applications. Traditional scalable network designs focus on fast point-to-point transmission of generic data packets. The proposed network focuses on the sustainability of high performance computing applications by statistical multiplexing of semantic data objects. For HPC applications using data-driven parallel processing, a tuple is a semantic object. We report the design and implementation of a tuple switching network for data parallel HPC applications in order to gain performance and reliability at the same time when adding computing and communication resources. We describe a sustainability model and a simple computational experiment to demonstrate extreme scale application's sustainability with decreasing system mean time between failures (MTBF). Assuming three times slowdown of statistical multiplexing and 35% time loss per checkpoint, a two-tier tuple switching framework would produce sustained performance and energy savings for extreme scale HPC application using more than 1024 processors or less than 6 hour MTBF. Higher processor counts or higher checkpoint overheads accelerate the benefits.

Tuple switching network-When slower may be better

Quantified Score

Visualization

Abstract