Language support for Morton-order matrices

Authors:
David S. Wise;Jeremy D. Frens;Yuhong Gu;Gregory A. Alexander
Affiliations:
Computer Science Dept., Indiana University, Bloomington, IN;Dept. of Computer Science, Calvin College, Grand Rapids, MI and Indiana University;Oracle Corporation, One Oracle Drive, Nashua, NH and Indiana University;Computer Science Dept., Indiana University, Bloomington, IN
Venue:
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Year:
2001

Citing 27
Cited 16

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The design and analysis of spatial data structures

The design and analysis of spatial data structures
Finding neighbors of equal size in linear quadtrees and octrees in constant time

CVGIP: Image Understanding
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Organizing arrays for paged memory systems

Communications of the ACM
Experiences of parallelising finite-element problems in a functional style

Software—Practice & Experience
LogP: a practical model of parallel computation

Communications of the ACM
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A Transformation System for Developing Recursive Programs

Journal of the ACM (JACM)
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
An effective way to represent quadtrees

Communications of the ACM
Storage reorganization techniques for matrix computation in a paging environment

Communications of the ACM
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Power aware page allocation

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Retargetable C Compiler: Design and Implementation

A Retargetable C Compiler: Design and Implementation
A class of data structures for associative searching

PODS '84 Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
The history of FORTRAN I, II, and III

ACM SIGPLAN Notices - Special issue: History of programming languages conference
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development

Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Is Morton Layout Competitive for Large Two-Dimensional Arrays?

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
User-defined distributions and layouts in chapel: philosophy and framework

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Minimizing associativity conflicts in morton layout

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A paradigm for parallel matrix algorithms: scalable cholesky

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Dual-addressing memory architecture for two-dimensional memory access patterns

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.