A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers)

Authors:
Olivier Beaumont;Vincent Boudet;Antoine Petitet
Affiliations:
UMR CNRS-ENS Lyon-INRIA, Lyon, France;UMR CNRS-ENS Lyon-INRIA, Lyon, France;UMR CNRS-ENS Lyon-INRIA, Lyon, France
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 25
Cited 27

QR factorization of a dense matrix on a hypercube multiprocessor

SIAM Journal on Scientific and Statistical Computing
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
Software libraries for linear algebra computations on high performance computers

SIAM Review
Software support for heterogeneous computing

ACM Computing Surveys (CSUR)
Array decompositions for nonuniform computational environments

Journal of Parallel and Distributed Computing
Minimizing the Application Execution Time Through Scheduling of Subtasks and Communication Traffic in a Heterogeneous Computing System

IEEE Transactions on Parallel and Distributed Systems
Parallel application scheduling on networks of workstations

Journal of Parallel and Distributed Computing
ScaLAPACK user's guide

ScaLAPACK user's guide
High-performance schedulers

The grid
On approximating rectangle tiling and packing

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties

Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties
Computers and Intractability; A Guide to the Theory of NP-Completeness

Computers and Intractability; A Guide to the Theory of NP-Completeness
Scheduling parallel applications in distributed networks

Cluster Computing
A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures

IEEE Transactions on Parallel and Distributed Systems
Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogeneous Computers

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
On the Complexity of the Generalized Block Distribution

IRREGULAR '96 Proceedings of the Third International Workshop on Parallel Algorithms for Irregularly Structured Problems
A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Dynamic, Competitive Scheduling of Multiple DAGs in a Distributed Heterogeneous Environment

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
The limited applicability of block decomposition in cluster computing

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming

Adaptive parallel computing on heterogeneous networks with mpC

Parallel Computing
Mapping and Load-Balancing Iterative Computations

IEEE Transactions on Parallel and Distributed Systems
Identifying and Breaking Necessary Constraints to Web-Based Metacomputing

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
On performance analysis of heterogeneous parallel algorithms

Parallel Computing
Optimizing the configuration of a heterogeneous cluster with multiprocessing and execution-time estimation

Parallel Computing - Heterogeneous computing
Heuristics for work distribution of a homogeneous parallel dynamic programming scheme on heterogeneous systems

Parallel Computing - Heterogeneous computing
Topology-aware tile mapping for clusters of SMPs

Proceedings of the 3rd conference on Computing frontiers
HeteroMPI: Towards a message-passing library for heterogeneous networks of computers

Journal of Parallel and Distributed Computing
Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm

Future Generation Computer Systems
Data Partitioning with a Functional Performance Model of Heterogeneous Processors

International Journal of High Performance Computing Applications
Data partitioning for multiprocessors with memory heterogeneity and memory constraints

Scientific Programming - International Symposium of Parallel and Distributed Computing & International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogenous Networks
Memetic algorithms for parallel code optimization

International Journal of Parallel Programming
Data distribution for dense factorization on computers with memory heterogeneity

Parallel Computing
Matrix product on heterogeneous master-worker platforms

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
On improving resource utilization and system throughput of master slave job scheduling in heterogeneous systems

The Journal of Supercomputing
How to Balance the Load on Heterogeneous Clusters

International Journal of High Performance Computing Applications
Centralized versus distributed schedulers for multiple bag-of-task applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
HeteroMPI+ScaLAPACK: towards a ScaLAPACK (dense linear solvers) on heterogeneous networks of computers

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Measuring the scalability of heterogeneous parallel systems

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A variable group block distribution strategy for dense factorizations on networks of heterogeneous computers

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Grid enabled master slave task scheduling for heterogeneous processor paradigm

GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Broadcast-Based parallel LU factorization

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Scheduling of job combination and dispatching strategy for grid and cloud system

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A framework for the application of metaheuristics to tasks-to-processors assignation problems

The Journal of Supercomputing
A novel algorithm of optimal matrix partitioning for parallel dense factorization on heterogeneous processors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies

Quantified Score

Hi-index	14.98

Visualization

Abstract

In this paper, we study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.