Mul-T: a high-performance parallel Lisp
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Compiling collection-oriented languages onto massively parallel computers
Journal of Parallel and Distributed Computing - Massively parallel computation
Lazy task creation: a technique for increasing the granularity of parallel programs
LFP '90 Proceedings of the 1990 ACM conference on LISP and functional programming
Implementation of a portable nested data-parallel language
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Lazy threads: implementing a fast parallel call
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
StackThreads/MP: integrating futures into calling standards
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Automatic thread distribution for nested parallelism in OpenMP
Proceedings of the 19th annual international conference on Supercomputing
A Mesh-of-Trees Interconnection Network for Single-Chip Parallel Processing
ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
Adaptive work stealing with parallelism feedback
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Fpga-based prototype of a pram-on-chip processor
Proceedings of the 5th conference on Computing frontiers
Proceedings of the conference on Design, automation and test in Europe
An adaptive cut-off for task parallelism
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
The Cilk++ concurrency platform
Proceedings of the 46th Annual Design Automation Conference
Proceedings of the 41st ACM technical symposium on Computer science education
Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Using simple abstraction to reinvent computing for parallelism
Communications of the ACM
Implicitly threaded parallelism in manticore
Journal of Functional Programming
Parcae: a system for flexible parallel execution
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Riposte: a trace-driven compiler and parallel VM for vector code in R
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.02 |
We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.