Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The design and analysis of spatial data structures
The design and analysis of spatial data structures
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Finding neighbors of equal size in linear quadtrees and octrees in constant time
CVGIP: Image Understanding
Matrix computations (3rd ed.)
An effective way to represent quadtrees
Communications of the ACM
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
The Hierarchically Tiled Arrays programming approach
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Fast additions on masked integers
ACM SIGPLAN Notices
A paradigm for parallel matrix algorithms: scalable cholesky
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms
Proceedings of the 2006 workshop on Memory system performance and correctness
Representation-transparent matrix algorithms with scalable performance
Proceedings of the 21st annual international conference on Supercomputing
Hi-index | 0.00 |
As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and FORTRAN, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from LI to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.