Parallelization libraries: Characterizing and reducing overheads

Authors:
Abhishek Bhattacharjee;Gilberto Contreras;Margaret Martonosi
Affiliations:
Rutgers University;Nvidia Corporation;Princeton University
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2011

Citing 20
Cited 5

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Using Hardware Operations to Reduce the Synchronization Overhead of Task Pools

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Performance Evaluation of Task Pools Based on Hardware Synchronization

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Impact of process variations on multicore performance symmetry

Proceedings of the conference on Design, automation and test in Europe
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Intel threading building blocks

Intel threading building blocks
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Comparing parallel performance of Go and C++ TBB on a direct acyclic task graph using a dynamic programming problem

Proceedings of the 50th Annual Southeast Regional Conference
Performance driven cooperation between kernel and auto-tuning multi-threaded interval b&b applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Journal of Parallel and Distributed Computing
Server-based scheduling of parallel real-time tasks

Proceedings of the tenth ACM international conference on Embedded software
Programming a Multicore Architecture without Coherency and Atomic Operations

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

Creating efficient, scalable dynamic parallel runtime systems for chip multiprocessors (CMPs) requires understanding the overheads that manifest at high core counts and small task sizes. In this article, we assess these overheads on Intel's Threading Building Blocks (TBB) and OpenMP. First, we use real hardware and simulations to detail various scheduler and synchronization overheads. We find that these can amount to 47% of TBB benchmark runtime and 80% of OpenMP benchmark runtime. Second, we propose load balancing techniques such as occupancy-based and criticality-guided task stealing, to boost performance. Overall, our study provides valuable insights for creating robust, scalable runtime libraries.