Super-Scalable algorithms for computing on 100,000 processors

Authors:
Christian Engelmann;Al Geist
Affiliations:
Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN;Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Year:
2005

Citing 7
Cited 17

Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Asynchronous Iterative Methods for Multiprocessors

Journal of the ACM (JACM)
MPI: The Complete Reference

MPI: The Complete Reference
A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform

CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments

MOLAR: adaptive runtime support for high-end computing operating and runtime systems

ACM SIGOPS Operating Systems Review
Supercomputing applications to the numerical modeling of industrial and applied mathematics problems

The Journal of Supercomputing
Robust scalability analysis and SPM case studies

The Journal of Supercomputing
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
GREMLINS: a large sparse linear solver for grid environment

Parallel Computing
A new stopping criterion for linear perturbed asynchronous iterations

Journal of Computational and Applied Mathematics
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
A parallel plug-in programming paradigm

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Investigating scaling behaviour of monte carlo codes for dense matrix inversion

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Parallel fault tolerant algorithms for parabolic problems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
HADAB: enabling fault tolerance in parallel applications running in distributed environments

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM BlueGene/L can have up to 128,000 processors and the delivery of the .rst system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless checkpointing algorithm for problems that can't be transformed into a superscalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation.