Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The design and analysis of spatial data structures
The design and analysis of spatial data structures
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Finding neighbors of equal size in linear quadtrees and octrees in constant time
CVGIP: Image Understanding
Matrix computations (3rd ed.)
An effective way to represent quadtrees
Communications of the ACM
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
The Hierarchically Tiled Arrays programming approach
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Fast additions on masked integers
ACM SIGPLAN Notices
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms
Proceedings of the 2006 workshop on Memory system performance and correctness
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A paradigm for parallel matrix algorithms: scalable cholesky
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Optimizing memory access on GPUs using morton order indexing
Proceedings of the 48th Annual Southeast Regional Conference
Two-dimensional cache-oblivious sparse matrix-vector multiplication
Parallel Computing
Hi-index | 0.00 |
As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.