QR factorization of a dense matrix on a hypercube multiprocessor
SIAM Journal on Scientific and Statistical Computing
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Scalability issues affecting the design of a dense linear algebra library
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
IBM Journal of Research and Development
Software support for heterogeneous computing
ACM Computing Surveys (CSUR)
Array decompositions for nonuniform computational environments
Journal of Parallel and Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
Parallel application scheduling on networks of workstations
Journal of Parallel and Distributed Computing
ScaLAPACK user's guide
The grid
On approximating rectangle tiling and packing
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties
Computers and Intractability; A Guide to the Theory of NP-Completeness
Computers and Intractability; A Guide to the Theory of NP-Completeness
Scheduling parallel applications in distributed networks
Cluster Computing
IEEE Transactions on Parallel and Distributed Systems
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
On the Complexity of the Generalized Block Distribution
IRREGULAR '96 Proceedings of the Third International Workshop on Parallel Algorithms for Irregularly Structured Problems
A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Dynamic, Competitive Scheduling of Multiple DAGs in a Distributed Heterogeneous Environment
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
The limited applicability of block decomposition in cluster computing
HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Adaptive parallel computing on heterogeneous networks with mpC
Parallel Computing
Mapping and Load-Balancing Iterative Computations
IEEE Transactions on Parallel and Distributed Systems
Identifying and Breaking Necessary Constraints to Web-Based Metacomputing
COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
On performance analysis of heterogeneous parallel algorithms
Parallel Computing
Parallel Computing - Heterogeneous computing
Parallel Computing - Heterogeneous computing
Topology-aware tile mapping for clusters of SMPs
Proceedings of the 3rd conference on Computing frontiers
HeteroMPI: Towards a message-passing library for heterogeneous networks of computers
Journal of Parallel and Distributed Computing
Future Generation Computer Systems
Data Partitioning with a Functional Performance Model of Heterogeneous Processors
International Journal of High Performance Computing Applications
Data partitioning for multiprocessors with memory heterogeneity and memory constraints
Scientific Programming - International Symposium of Parallel and Distributed Computing & International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogenous Networks
Memetic algorithms for parallel code optimization
International Journal of Parallel Programming
Matrix product on heterogeneous master-worker platforms
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The Journal of Supercomputing
How to Balance the Load on Heterogeneous Clusters
International Journal of High Performance Computing Applications
Centralized versus distributed schedulers for multiple bag-of-task applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Measuring the scalability of heterogeneous parallel systems
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Grid enabled master slave task scheduling for heterogeneous processor paradigm
GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
Broadcast-Based parallel LU factorization
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Scheduling of job combination and dispatching strategy for grid and cloud system
GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A framework for the application of metaheuristics to tasks-to-processors assignation problems
The Journal of Supercomputing
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Hi-index | 14.98 |
In this paper, we study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the load-balancing problem can be solved rather easily. When targeting two-dimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D load-balancing problem and prove its NP-completeness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.