Characterizing and mitigating work time inflation in task parallel programs

Authors:
Stephen L. Olivier;Bronis R. de Supinski;Martin Schulz;Jan F. Prins
Affiliations:
Univ. of North Carolina at Chapel Hill, Chapel Hill, NC;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Univ. of North Carolina at Chapel Hill, Chapel Hill, NC
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 28
Cited 1

NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Exploiting memory affinity in OpenMP through schedule reuse

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Scheduling multithreaded computations by work stealing

SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Intel threading building blocks

Intel threading building blocks
Effective performance measurement and analysis of multithreaded applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Jedule: A Tool for Visualizing Schedules of Parallel Applications

ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
Enabling locality-aware computations in OpenMP

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Scheduling task parallelism on multi-socket multicore systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Towards NUMA support with distance information

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Hierarchical place trees: a portable abstraction for task parallelism and data movement

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Binding nested OpenMP programs on hierarchical memory architectures

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Assessing OpenMP tasking implementations on NUMA architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World

Design and implementation of a customizable work stealing scheduler

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation -- additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.