An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
A gentle introduction to Haskell
ACM SIGPLAN Notices - Haskell special issue
Stability of block algorithms with fast level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Retire Fortran?: a debate rekindled
Communications of the ACM
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Symbolic mathematics system evaluators (extended abstract)
ISSAC '96 Proceedings of the 1996 international symposium on Symbolic and algebraic computation
LogP: a practical model of parallel computation
Communications of the ACM
Matrix computations (3rd ed.)
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Storage reorganization techniques for matrix computation in a paging environment
Communications of the ACM
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
ACM '77 Proceedings of the 1977 annual conference
Representing matrices as quadtrees for parallel processors: extended abstract
ACM SIGSAM Bulletin
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Automatic parallelization of divide and conquer algorithms
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Symbolic bounds analysis of pointers, array indices, and accessed memory regions
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
Pthreads for dynamic and irregular parallelism
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Global static indexing for real-time exploration of very large regular grids
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Optimizing Graph Algorithms for Improved Cache Performance
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Recursion Unrolling for Divide and Conquer Programs
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
CC '01 Proceedings of the 10th International Conference on Compiler Construction
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
IEEE Transactions on Parallel and Distributed Systems
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
IEEE Transactions on Parallel and Distributed Systems
Comparing Parallel Functional Languages: Programming and Performance
Higher-Order and Symbolic Computation
ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion
ACM Transactions on Mathematical Software (TOMS)
Symbolic bounds analysis of pointers, array indices, and accessed memory regions
ACM Transactions on Programming Languages and Systems (TOPLAS)
A fully portable high performance minimal storage hybrid format Cholesky algorithm
ACM Transactions on Mathematical Software (TOMS)
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
A hierarchical model of data locality
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms
Proceedings of the 2006 workshop on Memory system performance and correctness
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Representation-transparent matrix algorithms with scalable performance
Proceedings of the 21st annual international conference on Supercomputing
Adaptive Strassen's matrix multiplication
Proceedings of the 21st annual international conference on Supercomputing
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Using non-canonical array layouts in dense matrix operations
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Evaluating ISA support and hardware support for recursive data layouts
HiPC'07 Proceedings of the 14th international conference on High performance computing
Using recursion to boost ATLAS's performance
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Cache-oblivious polygon indecomposability testing
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Oracle scheduling: controlling granularity in implicitly parallel languages
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
ACM Transactions on Algorithms (TALG)
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm
The Journal of Supercomputing
A study on load imbalance in parallel hypermatrix multiplication using OpenMP
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
IFL'05 Proceedings of the 17th international conference on Implementation and Application of Functional Languages
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.