Inspector/executor load balancing algorithms for block-sparse tensor contractions

Authors:
David Ozog;Sameer Shende;Allen Malony;Jeff R. Hammond;James Dinan;Pavan Balaji
Affiliations:
University of Oregon, Eugene, Oregon, USA;University of Oregon, Eugene, Oregon, USA;University of Oregon, Eugene, Oregon, USA;Argonne National Laboratory, Lemont, Illinois, USA;Argonne National Laboratory, Lemont, Illinois, USA;Argonne National Laboratory, Lemont, Illinois, USA
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 15
Cited 1

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Efficient run-time support for irregular block-structured applications

Journal of Parallel and Distributed Computing - Special issue on irregular problems in supercomputing applications
Global arrays: a portable "shared-memory" programming model for distributed memory computers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Zoltan Data Management Service for Parallel Dynamic Applications

Computing in Science and Engineering
An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications

IEEE Transactions on Parallel and Distributed Systems
Run-Time Support for Multi-tier Programming of Block-Structured Applications on SMP Clusters

ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
Algorithm Design

Algorithm Design
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
Hypergraph partitioning for automatic memory hierarchy management

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the Mapping Problem

IEEE Transactions on Computers
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice & Experience
Load Balancing of Dynamical Nucleation Theory Monte Carlo Simulations through Resource Sharing Barriers

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Heuristic static load-balancing algorithm applied to the fragment molecular orbital method

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Performance Modeling for Dense Linear Algebra

SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

A framework for load balancing of tensor contraction expressions via dynamic task partitioning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.