The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Is Morton Layout Competitive for Large Two-Dimensional Arrays?
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Hi-index | 0.00 |
Recursive data layouts for matrices (two dimensional arrays) have been proposed to ameliorate the poor data locality caused by traditional layouts like row-major and column-major [3][12]. However, recursive data layouts require non-traditional address computation which involves bit-level manipulations that are not supported in current processors. As such, a number of software-based address computation techniques have been developed ranging from table-lookup based techniques to arithmetic-and-logic-operation based techniques. This effectively creates a tradeoff of extra computation for locality. In this paper, we design the appropriate instruction set architecture (ISA) support and hardware support to achieve address computation for recursive data layouts. Our technique captures the benefits of locality of the sophisticated data layouts while avoiding the cost of software-based address computation. Simulations reveal that our hardware approach improves the performance of matrix multiplication by factors ranging 30% to 59% over software-computed Morton-ordered indexing, especially at larger matrix sizes.