Using memory mapping to support cactus stacks in work-stealing runtime systems

Authors:
I-Ting Angelina Lee;Silas Boyd-Wickizer;Zhiyi Huang;Charles E. Leiserson
Affiliations:
MIT CSAIL, Cambridge, MA, USA;MIT CSAIL, Cambridge, MA, USA;University of Otago, Dunedin, New Zealand;MIT CSAIL, Cambridge, MA, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 30
Cited 4

MULTILISP: a language for concurrent symbolic computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
Workcrews: an abstraction for controlling parallelism

International Journal of Parallel Programming
Mul-T: a high-performance parallel Lisp

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Scalable reader-writer synchronization for shared-memory multiprocessors

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Randomized parallel algorithms for backtrack search and branch-and-bound computation

Journal of the ACM (JACM)
Rewriting executable files to measure program behavior

Software—Practice & Experience
Studying overheads in massively parallel MIN/MAX-tree evaluation

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Executing multithreaded programs efficiently

Executing multithreaded programs efficiently
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
A Java fork/join framework

Proceedings of the ACM 2000 conference on Java Grande
The C++ Programming Language

The C++ Programming Language
The C Programming Language

The C Programming Language
Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Synchronized mimd computing

Synchronized mimd computing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Distributed filaments: efficient fine-grain parallelism on a cluster of workstations

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Intel threading building blocks

Intel threading building blocks
Burroughs' B6500/B7500 stack mechanism

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Brief announcement: a lower bound for depth-restricted work stealing

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications

BWS: balanced work stealing for time-sharing multicores

Proceedings of the 7th ACM european conference on Computer Systems
Memory-mapping support for reducer hyperobjects

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Fence-free work stealing on bounded TSO processors

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a "cactus stack," wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.