Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Interprocessor Traffic Scheduling Algorithm for Multiple-Processor Networks
IEEE Transactions on Computers
Task allocation onto a hypercube by recursive mincut bipartitioning
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
On the Communication Complexity of Generalized 2-D Convolution on Array Processors
IEEE Transactions on Computers
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Diffusive Load-Balancing Policies for Dynamic Applications
IEEE Concurrency
Strategies for Dynamic Load Balancing on Highly Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
New challanges in dynamic load balancing
Applied Numerical Mathematics - Adaptive methods for partial differential equations and large-scale computation
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Achieving high performance on extremely large parallel machines: performance prediction and load balancing
Blue Gene/L programming and operating environment
IBM Journal of Research and Development
Topology-aware task mapping for reducing communication contention on large parallel machines
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance and scalable I/O virtualization via self-virtualized devices
Proceedings of the 16th international symposium on High performance distributed computing
Holistic aggregate resource environment
ACM SIGOPS Operating Systems Review
Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l
Proceedings of the 22nd annual international conference on Supercomputing
Hi-index | 0.00 |
Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.