Combining analytical and empirical approaches in tuning matrix transposition

Authors:
Qingda Lu;Sriram Krishnamoorthy;P. Sadayappan
Affiliations:
The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Year:
2006

Citing 19
Cited 4

Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Method for Transposing a Matrix

Journal of the ACM (JACM)
Algorithm 467: Matrix Transposition in Place

Communications of the ACM
Algorithm 380: in-situ transposition of a rectangular matrix [F1]

Communications of the ACM
An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and Its Applications for Parallel Architectures

IEEE Transactions on Parallel and Distributed Systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

SIAM Journal on Scientific Computing
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Towards an Optimal Bit-Reversal Permutation Program

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Automatic measurement of memory hierarchy parameters

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Measuring and improving application performance with PerfSuite

Linux Journal
Tuning High Performance Kernels through Empirical Compilation

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Think globally, search locally

Proceedings of the 19th annual international conference on Supercomputing
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation

Generating SIMD vectorized permutations

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code - the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework.