Lifeline-based global load balancing

Authors:
Vijay A. Saraswat;Prabhanjan Kambadur;Sreedhar Kodali;David Grove;Sriram Krishnamoorthy
Affiliations:
IBM TJ Watson Research Centre, Hawthorne, NY, USA;IBM TJ Watson Research Centre, Yorktown, USA;IBM Systems and Technology Group, Bangalore, India;IBM TJ Watson Research Centre, Hathorne, USA;Pacific Northwest National Laboratory, Richland, USA
Venue:
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Year:
2011

Citing 21
Cited 11

Derivation of a termination detection algorithm for distributed computations

Control Flow and Data Flow: concepts of distributed programming
CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Scalable load balancing techniques for parallel computers

Journal of Parallel and Distributed Computing
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
ATLAS: an infrastructure for global computing

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Starting with termination: a methodology for building distributed garbage collection algorithms

ACSC '01 Proceedings of the 24th Australasian conference on Computer science
State of the Art in Parallel Search Techniques for Discrete Optimization Problems

IEEE Transactions on Knowledge and Data Engineering
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
The Natural Work-Stealing Algorithm is Stable

SIAM Journal on Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Achieving Distributed Termination without Freezing

IEEE Transactions on Software Engineering
Scheduling multithreaded computations by work stealing

SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Scalable Dynamic Load Balancing Using UPC

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Intel threading building blocks

Intel threading building blocks
Work-first and help-first scheduling policies for async-finish task parallelism

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
PFunc: modern task parallelism for modern high performance computing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
UTS: an unbalanced tree search benchmark

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing

Evaluating the performance and scalability of mapreduce applications on X10

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
A work-stealing scheduler for X10's task parallelism with suspension

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Work stealing and persistence-based load balancers for iterative overdecomposed applications

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Dynamic distributed scheduling algorithm for state space search

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Using load information in work-stealing on distributed systems with non-uniform communication latencies

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hybrid parallel task placement in X10

Proceedings of the third ACM SIGPLAN X10 Workshop
A work-stealing scheduling framework supporting fault tolerance

Proceedings of the Conference on Design, Automation and Test in Europe
X10 and APGAS at Petascale

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Resilient X10: efficient failure-aware programming

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GLB: lifeline-based global load balancing library in x10

Proceedings of the first workshop on Parallel programming for analytics applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary. Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes. In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.