HPC-Colony: services and interfaces for very large systems

Authors:
Sayantan Chakravorty;Celso L. Mendes;Laxmikant V. Kalé;Terry Jones;Andrew Tauferner;Todd Inglett;José Moreira
Affiliations:
University of Illinois;University of Illinois;University of Illinois;Lawrence Livermore National Lab.;IBM;IBM;IBM
Venue:
ACM SIGOPS Operating Systems Review
Year:
2006

Citing 23
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Interprocessor Traffic Scheduling Algorithm for Multiple-Processor Networks

IEEE Transactions on Computers
Task allocation onto a hypercube by recursive mincut bipartitioning

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
On the Communication Complexity of Generalized 2-D Convolution on Array Processors

IEEE Transactions on Computers
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Diffusive Load-Balancing Policies for Dynamic Applications

IEEE Concurrency
Strategies for Dynamic Load Balancing on Highly Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

Cluster Computing
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving application performance on HPC systems with process synchronization

Linux Journal
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
New challanges in dynamic load balancing

Applied Numerical Mathematics - Adaptive methods for partial differential equations and large-scale computation
Building and Using a Fault-Tolerant MPI Implementation

International Journal of High Performance Computing Applications
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Achieving high performance on extremely large parallel machines: performance prediction and load balancing

Achieving high performance on extremely large parallel machines: performance prediction and load balancing
Blue Gene/L programming and operating environment

IBM Journal of Research and Development
Topology-aware task mapping for reducing communication contention on large parallel machines

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

High performance and scalable I/O virtualization via self-virtualized devices

Proceedings of the 16th international symposium on High performance distributed computing
Holistic aggregate resource environment

ACM SIGOPS Operating Systems Review
Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l

Proceedings of the 22nd annual international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.