An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The design and analysis of spatial data structures
The design and analysis of spatial data structures
Finding neighbors of equal size in linear quadtrees and octrees in constant time
CVGIP: Image Understanding
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Organizing arrays for paged memory systems
Communications of the ACM
Experiences of parallelising finite-element problems in a functional style
Software—Practice & Experience
LogP: a practical model of parallel computation
Communications of the ACM
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A Transformation System for Developing Recursive Programs
Journal of the ACM (JACM)
Transforming loops to recursion for multi-level memory hierarchies
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
An effective way to represent quadtrees
Communications of the ACM
Storage reorganization techniques for matrix computation in a paging environment
Communications of the ACM
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Retargetable C Compiler: Design and Implementation
A Retargetable C Compiler: Design and Implementation
A class of data structures for associative searching
PODS '84 Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
The history of FORTRAN I, II, and III
ACM SIGPLAN Notices - Special issue: History of programming languages conference
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Is Morton Layout Competitive for Large Two-Dimensional Arrays?
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms
Proceedings of the 2006 workshop on Memory system performance and correctness
Representation-transparent matrix algorithms with scalable performance
Proceedings of the 21st annual international conference on Supercomputing
Fast indexing for blocked array layouts to reduce cache misses
International Journal of High Performance Computing and Networking
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Evaluating ISA support and hardware support for recursive data layouts
HiPC'07 Proceedings of the 14th international conference on High performance computing
User-defined distributions and layouts in chapel: philosophy and framework
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Minimizing associativity conflicts in morton layout
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A paradigm for parallel matrix algorithms: scalable cholesky
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Dual-addressing memory architecture for two-dimensional memory access patterns
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.00 |
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.