SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Authors:
Ernie Chan;Field G. Van Zee;Paolo Bientinesi;Enrique S. Quintana-Orti;Gregorio Quintana-Orti;Robert van de Geijn
Affiliations:
The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;Duke University, Durham, NC, USA;Universidad Jaume I, Castellon, Spain;Universidad Jaume I, Castellon, Spain;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 23
Cited 14

Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
MaTRiX+/sup +/: an object-oriented environment for parallel high-performance matrix computations

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
BLAS Based on Block Data Structures

BLAS Based on Block Data Structures
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
OpenMP issues arising in the development of parallel BLAS and LAPACK libraries

Scientific Programming - OpenMP
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Mechanical derivation and systematic analysis of correct linear algebra algorithms

Mechanical derivation and systematic analysis of correct linear algebra algorithms
Scalable parallelization of FLAME code via the workqueuing model

ACM Transactions on Mathematical Software (TOMS)
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Satisfying your dependencies with SuperMatrix

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization

High Performance Computing for Computational Science - VECPAR 2008
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Shared Register File Based ILP for Multicore

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Solving dense interval linear systems with verified computing on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
A C++ library for rapid development of efficient parallel dense linear algebra codes for multicore computers

Proceedings of the 51st ACM Southeast Conference
Towards a functional run-time for dense NLA domain

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Dynamic load balancing on heterogeneous multi-GPU systems

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.