SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Method for Transposing a Matrix
Journal of the ACM (JACM)
Algorithm 467: Matrix Transposition in Place
Communications of the ACM
Algorithm 380: in-situ transposition of a rectangular matrix [F1]
Communications of the ACM
IEEE Transactions on Parallel and Distributed Systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors
SIAM Journal on Scientific Computing
A Layout-Conscious Iteration Space Transformation Technique
IEEE Transactions on Computers
Towards an Optimal Bit-Reversal Permutation Program
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Automatic measurement of memory hierarchy parameters
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generation of permutations for SIMD processors
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Tuning High Performance Kernels through Empirical Compilation
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Think globally, search locally
Proceedings of the 19th annual international conference on Supercomputing
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Generating SIMD vectorized permutations
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Journal of Parallel and Distributed Computing
Loop transformation recipes for code generation and auto-tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Hi-index | 0.00 |
Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code - the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework.