Adaptive and reliable parallel computing on networks of workstations

  • Authors:
  • Robert D. Blumofe;Philip A. Lisiecki

  • Affiliations:
  • Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;MIT Laboratory for Computer Science, Cambridge, Massachusetts

  • Venue:
  • ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced "silk") is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a provably efficient threadscheduling algorithm. Cilk-NOW is such a runtime system, and in addition, Cilk-NOW automatically delivers adaptive and reliable execution for a functional subset of Cilk programs. By adaptive execution, we mean that each Cilk program dynamically utilizes a changing set of otherwise-idle workstations. By reliable execution, we mean that the Cilk-NOW system as a whole and each executing Cilk program are able to tolerate machine and network faults. Cilk-NOW provides these features while programs remain fault oblivious, meaning that Cilk programmers need not code for fault tolerance. Throughout this paper, we focus on end-to-end design decisions, and we show how these decisions allow the design to exploit high-level algorithmic properties of the Cilk programming model in order to simplify and streamline the implementation.