NUMA policies and their relation to memory architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Provably efficient scheduling for languages with fine-grained parallelism
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Exploiting memory affinity in OpenMP through schedule reuse
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Hardware profile-guided automatic page placement for ccNUMA systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Intel threading building blocks
Intel threading building blocks
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
IEEE Transactions on Parallel and Distributed Systems
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Jedule: A Tool for Visualizing Schedules of Parallel Applications
ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
Enabling locality-aware computations in OpenMP
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Towards NUMA support with distance information
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Hierarchical place trees: a portable abstraction for task parallelism and data movement
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Binding nested OpenMP programs on hierarchical memory architectures
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Assessing OpenMP tasking implementations on NUMA architectures
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Design and implementation of a customizable work stealing scheduler
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Hi-index | 0.00 |
Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation -- additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.