Synchronized Distributed Termination
IEEE Transactions on Software Engineering
Computer
How to write parallel programs: a first course
How to write parallel programs: a first course
Elements of information theory
Elements of information theory
The impossibility of implementing reliable communication in the face of crashes
Journal of the ACM (JACM)
The topological structure of asynchronous computability
Journal of the ACM (JACM)
Parallel programming in OpenMP
Parallel programming in OpenMP
Simulation Modeling and Analysis
Simulation Modeling and Analysis
Computer Networking: A Top-Down Approach Featuring the Internet Package
Computer Networking: A Top-Down Approach Featuring the Internet Package
Distributed Systems: Principles and Paradigms
Distributed Systems: Principles and Paradigms
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Automatic program parallelization using stateless parallel processing architecture
Automatic program parallelization using stateless parallel processing architecture
Computer
A fault tolerance protocol for stateless parallel processing
A fault tolerance protocol for stateless parallel processing
International Journal of High Performance Computing Applications
Decoupling as a Foundation for Large Scale Parallel Computing
HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Reducers and other Cilk++ hyperobjects
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
Hi-index | 0.00 |
This paper reports an application dependent network design for extreme scale high performance computing (HPC) applications. Traditional scalable network designs focus on fast point-to-point transmission of generic data packets. The proposed network focuses on the sustainability of high performance computing applications by statistical multiplexing of semantic data objects. For HPC applications using data-driven parallel processing, a tuple is a semantic object. We report the design and implementation of a tuple switching network for data parallel HPC applications in order to gain performance and reliability at the same time when adding computing and communication resources. We describe a sustainability model and a simple computational experiment to demonstrate extreme scale application's sustainability with decreasing system mean time between failures (MTBF). Assuming three times slowdown of statistical multiplexing and 35% time loss per checkpoint, a two-tier tuple switching framework would produce sustained performance and energy savings for extreme scale HPC application using more than 1024 processors or less than 6 hour MTBF. Higher processor counts or higher checkpoint overheads accelerate the benefits.