Recursive Array Layouts and Fast Matrix Multiplication

Authors:
Siddhartha Chatterjee;Alvin R. Lebeck;Praveen K. Patnala;Mithuna Thottethodi
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2002

Citing 43
Cited 18

Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Data optimization: allocation of arrays to reduce communication on SIMD machines

Journal of Parallel and Distributed Computing - Massively parallel computation
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The high performance Fortran handbook

The high performance Fortran handbook
Automatic data partitioning on distributed memory multicomputers

Automatic data partitioning on distributed memory multicomputers
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
GEMMW: a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm

Journal of Computational Physics
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Balancing processor loads and exploiting data locality in N-body simulations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves

IEEE Transactions on Parallel and Distributed Systems
Strategic directions in storage I/O issues in large-scale computing

ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
Analysis of the clustering properties of Hilbert space-filling curve

Analysis of the clustering properties of Hilbert space-filling curve
Efficient out-of-core algorithms for linear relaxation using blocking covers

Journal of Computer and System Sciences - Special issue: papers from the 32nd and 34th annual symposia on foundations of computer science, Oct. 2–4, 1991 and Nov. 3–5, 1993
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
High performance Fortran for highly irregular problems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Automatic data layout for distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Digital Design

Digital Design
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Recursive Formulation of Cholesky Algorithm in Fortran 90

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Efficient Procedures for Using Matrix Algorithms

Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development

QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Estimating influence of data layout optimizations on SDRAM energy consumption

Proceedings of the 2003 international symposium on Low power electronics and design
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Fast additions on masked integers

ACM SIGPLAN Notices
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization

High Performance Computing for Computational Science - VECPAR 2008
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
An approach to locality-conscious load balancing and transparent memory hierarchy management with a global- address-space parallel programming model

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm

The Journal of Supercomputing
A paradigm for parallel matrix algorithms: scalable cholesky

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.