Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Authors:
Michael D. Adams;David S. Wise
Affiliations:
Indiana University, Bloomington, IN;Indiana University, Bloomington, IN
Venue:
Proceedings of the 2006 workshop on Memory system performance and correctness
Year:
2006

Citing 14
Cited 4

An efficient block-oriented approach to parallel sparse Cholesky factorization

SIAM Journal on Scientific Computing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Storage reorganization techniques for matrix computation in a paging environment

Communications of the ACM
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
The Matrix Template Library: Generic Components for High-Performance Scientific Computing

Computing in Science and Engineering
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Fast additions on masked integers

ACM SIGPLAN Notices
Is Morton layout competitive for large two-dimensional arrays yet?: Research Articles

Concurrency and Computation: Practice & Experience - 10th International Workshop on Compilers for Parallel Computers (CPC 2003)
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
A paradigm for parallel matrix algorithms: scalable cholesky

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Communication-avoiding parallel strassen: implementation and performance

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

A blossoming paradigm for block-recursive matrix algorithms is presented that, at once, attains excellent performance measured by• time• TLB misses• L1 misses• L2 misses• paging to disk• scaling on distributed processors, and• portability to multiple platforms.It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cache-oblivious style of programming.Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on low-level performance, including the new Morton-hybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel's Matrix Kernel Library and matches AMD's Core Math Library, losing a bit on L1 misses while winning decisively on TLB-misses.