Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Asynchronous Iterative Methods for Multiprocessors
Journal of the ACM (JACM)
MPI: The Complete Reference
A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
MOLAR: adaptive runtime support for high-end computing operating and runtime systems
ACM SIGOPS Operating Systems Review
Supercomputing applications to the numerical modeling of industrial and applied mathematics problems
The Journal of Supercomputing
Robust scalability analysis and SPM case studies
The Journal of Supercomputing
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
GREMLINS: a large sparse linear solver for grid environment
Parallel Computing
A new stopping criterion for linear perturbed asynchronous iterations
Journal of Computational and Applied Mathematics
International Journal of High Performance Computing Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
A parallel plug-in programming paradigm
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Investigating scaling behaviour of monte carlo codes for dense matrix inversion
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Parallel fault tolerant algorithms for parabolic problems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Future Generation Computer Systems
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
HADAB: enabling fault tolerance in parallel applications running in distributed environments
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Future Generation Computer Systems
Hi-index | 0.00 |
In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM BlueGene/L can have up to 128,000 processors and the delivery of the .rst system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless checkpointing algorithm for problems that can't be transformed into a superscalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation.