Parallel depth first search. Part II. analysis
International Journal of Parallel Programming
Scalable load balancing techniques for parallel computers
Journal of Parallel and Distributed Computing
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Analytic Comparison of Two Advanced C Language-Based Parallel Programming Models
ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Graphs over time: densification laws, shrinking diameters and possible explanations
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Hardware profile-guided automatic page placement for ccNUMA systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Evaluating OpenMP 3.0 Run Time Systems on Unbalanced Task Graphs
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Analyzing lock contention in multithreaded applications
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Lifeline-based global load balancing
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Unbalanced tree search on a manycore system using the GPI programming model
Computer Science - Research and Development
Work stealing for multi-core HPC clusters
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Shared work list: hacking amorphous data parallelism in UPC
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Work-stealing with configurable scheduling strategies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A synthetic task model for HPC-grade optical network performance evaluation
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Load balancing non-uniform parallel computations
Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control
Heterogeneous-race-free memory models
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Friendly barriers: efficient work-stealing with return barriers
Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Hi-index | 0.00 |
This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.