Scalable load balancing techniques for parallel computers
Journal of Parallel and Distributed Computing
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Design of dynamic load-balancing tools for parallel applications
Proceedings of the 14th international conference on Supercomputing
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Scalable Dynamic Load Balancing Using UPC
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
New challenges in dynamic load balancing
Applied Numerical Mathematics - Adaptive methods for partial differential equations and large-scale computation
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
UTS: an unbalanced tree search benchmark
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Hi-index | 0.00 |
The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous load balance implementation to utilize the full performance a Manycore system. For example, the recently established Graph500 benchmark aims at such applications. The UTS benchmark characterizes the performance of such irregular and unbalanced computations with a tree-structured search space that requires continuous dynamic load balancing. GPI is a PGAS API that delivers the full performance of RDMA-enabled networks directly to the application. Its programming model focuses the use of one-sided asynchronous communication, overlapping computation and communication. In this paper we address the dynamic load balancing requirements of unbalanced applications using the GPI programming model. Using the UTS benchmark, we detail the implementation of a work stealing algorithm using GPI and present the performance results. Our performance evaluation shows significant improvements when compared with the optimized MPI version with a maximum performance of 9.5 billion nodes per second on 3072 cores.