Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Intel threading building blocks
Intel threading building blocks
Feedback-directed pipeline parallelism
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
BWS: balanced work stealing for time-sharing multicores
Proceedings of the 7th ACM european conference on Computer Systems
Server-based scheduling of parallel real-time tasks
Proceedings of the tenth ACM international conference on Embedded software
Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Parallel programming is a requirement in the multi-core era. One of the most promising techniques to make parallel programming available for general users is the use of parallel programming patterns. Functional pipeline parallelism is a well suited pattern for many emerging applications, such as streaming and "Recognition, Mining and Synthesis" (RMS) workloads. In this paper we develop an analytical model for pipeline parallelism and use it to characterize and optimize two of the PARSEC benchmarks which use the parallel pipeline pattern, ferret and dedup. We identify two scalability limitations: load imbalance and I/O bottlenecks. We address load imbalance using two techniques: parallel pipeline stage collapsing and dynamic scheduling. We implemented these optimizations using Pthreads and the Threading Building Blocks (TBB) libraries. We compare predicted and measured performance of all these implementations on a large scale SMP machine and we note that the work-stealing TBB implementation outperforms all other variants.