Adaptive and reliable parallel computing on networks of workstations

Authors:
Robert D. Blumofe;Philip A. Lisiecki
Affiliations:
Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;MIT Laboratory for Computer Science, Cambridge, Massachusetts
Venue:
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Year:
1997

Citing 30
Cited 45

Using idle workstations in a shared computing environment

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
The V distributed system

Communications of the ACM
The Sprite Network Operating System

Computer
Linda in context

Communications of the ACM
The Amber system: parallel programming on a network of multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Experiences with the Amoeba distributed operating system

Communications of the ACM
UNIX network programming

UNIX network programming
Transparent process migration: design alternatives and the sprite implementation

Software—Practice & Experience
Compiling with continuations

Compiling with continuations
Spawn: A Distributed Computational Economy

IEEE Transactions on Software Engineering
DAWGS—a distributed compute server utilizing idle workstations

Journal of Parallel and Distributed Computing
Supercomputing out of recycled garbage: preliminary experience with Piranha

ICS '92 Proceedings of the 6th international conference on Supercomputing
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Utopia: a load sharing facility for large, heterogeneous distributed computer systems

Software—Practice & Experience
Efficient parallel computing in distributed workstation environments

Parallel Computing
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The interaction of parallel and sequential workloads on a network of workstations

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Hive: fault containment for shared-memory multiprocessors

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
The cilk system for parallel multithreaded computing

The cilk system for parallel multithreaded computing
Executing multithreaded programs efficiently

Executing multithreaded programs efficiently
End-to-end arguments in system design

ACM Transactions on Computer Systems (TOCS)
COOL: An Object-Based Language for Parallel Programming

Computer
A Case for NOW (Networks of Workstations)

IEEE Micro
Dag-Consistent Distributed Shared Memory

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The LOCUS distributed operating system

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Transparent fault tolerance for parallel applications on networks of workstations

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Transparent adaptive parallelism on NOWs using OpenMP

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Dividing the application definition from the execution

Computing in Science and Engineering
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Computational paradigms and protection

Proceedings of the 2001 workshop on New security paradigms
Trustless Grid Computing in ConCert

GRID '02 Proceedings of the Third International Workshop on Grid Computing
Satin: Efficient Parallel Divide-and-Conquer in Java

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Adaptive Parallelism for OpenMP Task Parallel Programs

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Webcom-G: grid enabled metacomputing

Neural, Parallel & Scientific Computations - Special issue: Grid computing
Adaptive scheduling with parallelism feedback

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
ML grid programming with ConCert

Proceedings of the 2006 workshop on ML
Probabilistic accuracy bounds for fault-tolerant computations that discard tasks

Proceedings of the 20th annual international conference on Supercomputing
Adaptive work stealing with parallelism feedback

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments

Scientific Programming - Dynamic Grids and Worldwide Computing
CX: A scalable, robust network for parallel computing

Scientific Programming
Parallel processing with windows NT networks

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors

Proceedings of the 2007 international workshop on Parallel symbolic computation
The co-replication methodology and its application to structured parallel programs

Proceedings of the 2007 symposium on Component and framework technology in high-performance and scientific computing
WSPE: a peer-to-peer programming environment for grid-unaware applications

Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
Adaptive work-stealing with parallelism feedback

ACM Transactions on Computer Systems (TOCS)
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Distributed Scheduling of Parallel Hybrid Computations

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Satin: A high-level and efficient grid programming model

ACM Transactions on Programming Languages and Systems (TOPLAS)
Selective Recovery from Failures in a Task Parallel Programming Model

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Lifeline-based global load balancing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Affinity driven distributed scheduling algorithm for parallel computations

ICDCN'11 Proceedings of the 12th international conference on Distributed computing and networking
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Dynamic workload balancing deques for branch and bound algorithms in the message passing interface

International Journal of High Performance Systems Architecture
Performance driven distributed scheduling of parallel hybrid computations

Theoretical Computer Science
Performance driven multi-objective distributed scheduling for parallel computations

ACM SIGOPS Operating Systems Review
BWS: balanced work stealing for time-sharing multicores

Proceedings of the 7th ACM european conference on Computer Systems
A down-to-earth look at the cloud host OS

Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
Consistent rollback protocols for autonomic ASSISTANT applications

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Work stealing and persistence-based load balancers for iterative overdecomposed applications

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Survey: Survey of fault tolerant techniques for grid

Computer Science Review
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Persistent fault-tolerance for divide-and-conquer applications on the grid

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Dynamic distributed scheduling algorithm for state space search

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Hybrid parallel task placement in X10

Proceedings of the third ACM SIGPLAN X10 Workshop
GLB: lifeline-based global load balancing library in x10

Proceedings of the first workshop on Parallel programming for analytics applications
Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced "silk") is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a provably efficient threadscheduling algorithm. Cilk-NOW is such a runtime system, and in addition, Cilk-NOW automatically delivers adaptive and reliable execution for a functional subset of Cilk programs. By adaptive execution, we mean that each Cilk program dynamically utilizes a changing set of otherwise-idle workstations. By reliable execution, we mean that the Cilk-NOW system as a whole and each executing Cilk program are able to tolerate machine and network faults. Cilk-NOW provides these features while programs remain fault oblivious, meaning that Cilk programmers need not code for fault tolerance. Throughout this paper, we focus on end-to-end design decisions, and we show how these decisions allow the design to exploit high-level algorithmic properties of the Cilk programming model in order to simplify and streamline the implementation.