Elemental: A New Framework for Distributed Memory Dense Matrix Computations

Authors:
Jack Poulson;Bryan Marker;Robert A. van de Geijn;Jeff R. Hammond;Nichols A. Romero
Affiliations:
University of Texas at Austin;University of Texas at Austin;University of Texas at Austin;Argonne Leadership Computing Facility;Argonne Leadership Computing Facility
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2013

Citing 19
Cited 9

Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem

A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem
Toward an Efficient Parallel Eigensolver for Dense Symmetric Matrices

SIAM Journal on Scientific Computing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Application of a high performance parallel eigensolver to electronic structure calculations

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Parallel Matrix Distributions: Have we been doing it all wrong?

Parallel Matrix Distributions: Have we been doing it all wrong?
An Automated Multilevel Substructuring Method for Eigenspace Computation in Linear Elastodynamics

SIAM Journal on Scientific Computing
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations

SIAM Journal on Scientific Computing
Accumulating Householder transformations, revisited

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Collective communication: theory, practice, and experience: Research Articles

Concurrency and Computation: Practice & Experience
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Programming the Intel 80-core network-on-a-chip terascale processor

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience

Using MPI derived datatypes in numerical libraries

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Alternative, uniformly expressive and more scalable interfaces for collective communication in MPI

Parallel Computing
Mechanizing the expert dense linear algebra developer

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience
Communication avoiding and overlapping for numerical linear algebra

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A direct solver for variable coefficient elliptic PDEs discretized via a composite spectral collocation method

Journal of Computational Physics
Algorithms for large-scale whole genome association analysis

Proceedings of the 20th European MPI Users' Group Meeting
Interfaces are key

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A case study in mechanically deriving dense linear algebra code

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.