The advantages of multiple parallelizations in combinatorial search
Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Models and scheduling algorithms for mixed data and task parallel programs
Journal of Parallel and Distributed Computing - Special issue on dynamic load balancing
A randomized parallel sorting algorithm with an experimental study
Journal of Parallel and Distributed Computing
A coordination language for mixed task and and data parallel programs
Proceedings of the 1999 ACM symposium on Applied computing
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
A Low-Cost Approach towards Mixed Task and Data Parallel Scheduling
ICPP '02 Proceedings of the 2001 International Conference on Parallel Processing
One-Step Algorithm for Mixed Data and Task Parallel Scheduling without Data Replication
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Concurrency and Computation: Practice & Experience
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
Proceedings of the 22nd annual international conference on Supercomputing
Mixed task and data parallel executions in general linear methods
Scientific Programming
Scheduling Parallel Task Graphs on (Almost) Homogeneous Multicluster Platforms
IEEE Transactions on Parallel and Distributed Systems
The Art of Multiprocessor Programming
The Art of Multiprocessor Programming
Building portable thread schedulers for hierarchical multiprocessors: the bubblesched framework
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
MCSTL: the multi-core standard template library
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Programming support and scheduling for communicating parallel tasks
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
We show how to extend classical work-stealing to deal with tightly coupled data parallel tasks that can require any number of threads r ≥ 1 for their execution, and term this extension work-stealing with deterministic team-building. As threads become idle they attempt to join a team of threads designated for a task requiring r 1 threads for its execution, alternatively to steal a task, requiring no central coordination. Team building and stealing are done according to a deterministic hierarchy and involve at most a logarithmic number of possibly randomized steal attempts. Threads attempting to join the team for a task requiring a large number of threads may help smaller teams while waiting for the large team to form. Once a team has been formed the threads can in close coordination execute the data parallel task. Implementation can be done with standard lock-free data structures, and takes only a single extra compare-and-swap (CAS) operation per thread to build a team. In the degenerate case where all tasks require only a single thread, the implementation coincides with a locality aware work-stealing implementation. Using a prototype C++ implementation of our extended work-stealing algorithm, a mixed-mode parallel Quicksort algorithm with a data parallel partitioning step has been implemented. We compare our (improved) implementation of this algorithm on top of our extended work-stealing scheduler to a standard task-parallel implementation with this scheduler, and with Intel Cilk Plus and Threading Building Blocks. In addition, we also compare to the optimized parallel MCSTL Quicksort. Results are shown for a 32-core Intel Nehalem EX system and a 16-core Sun T2+ system supporting up to 128 hardware threads. The mixed-mode parallel algorithm performs consistently better than the fork-join implementation, often significantly.