Optimal static load balancing in distributed computer systems
Journal of the ACM (JACM)
An Algorithm for Optimal Static Load Balancing in Distributed Computer Systems
IEEE Transactions on Computers
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys (CSUR)
Efficient load balancing for wide-area divide-and-conquer applications
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Scalable Dynamic Load Balancing Using UPC
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
UTS: an unbalanced tree search benchmark
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Dynamic distributed scheduling algorithm for state space search
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A work-stealing scheduling framework supporting fault tolerance
Proceedings of the Conference on Design, Automation and Test in Europe
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Load balancing non-uniform parallel computations
Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control
Hi-index | 0.00 |
Today a significant fraction of HPC clusters are built from multi-core machines connected via a high speed interconnect, hence, they have a mix of shared memory and distributed memory. Work stealing algorithms are currently designed for either a shared memory architecture or for a distributed memory architecture and are extended to work on these multi-core clusters by assuming a single underlying architecture. However, as the number of cores in each node increase, the differences between a shared memory architecture and a distributed memory architecture become more acute. Current work stealing approaches are not suitable for multi-core clusters due to the dichotomy of the underlying architecture. We combine the best aspects of both the current approaches in to a new algorithm. Our algorithm allows for more efficient execution of large-scale HPC applications, such as UTS, on clusters which have large multi-cores. As the number of cores per node increase, which is inevitable given today's processor trends, such an approach is crucial.