LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Advances in parallel algorithms
Advances in parallel algorithms
The divide-and-conquer paradigm as a basis for parallel language design
Advances in parallel algorithms
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Parallel and distributed computing handbook
Parallel and distributed computing handbook
LogP: a practical model of parallel computation
Communications of the ACM
Matrix computations (3rd ed.)
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
ScaLAPACK user's guide
Advanced compiler design and implementation
Advanced compiler design and implementation
Undulant-block elimination and integer-preserving matrix inversion
Science of Computer Programming
A Transformation System for Developing Recursive Programs
Journal of the ACM (JACM)
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
LogGPS: a parallel computational model for synchronization analysis
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Matrix factorization using a block-recursive structure and block-recursive algorithms
Matrix factorization using a block-recursive structure and block-recursive algorithms
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
A hierarchical model of data locality
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A library of constructive skeletons for sequential style of parallel programming
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
A compositional framework for developing parallel programs on two-dimensional arrays
International Journal of Parallel Programming
Surrounding theorem: developing parallel programs for matrix-convolutions
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divide-and-conquer algorithm breaks problems down into independent computations. These independent computations can be dispatched in parallel for straightforward parallel processing.Proof-of-concept is given by an algorithm for QR factorization based on Givens rotations for quadtree matrices in Morton-order storage. The algorithms deliver positive results, competing with and even beating the LAPACK equivalent.