A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
High-level semantic optimization of numerical codes
ICS '99 Proceedings of the 13th international conference on Supercomputing
High-performacne parallel implicit CFD
Parallel Computing - Special issue on parallel computing in aerospace
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Artificial Intelligence: A Modern Approach
Artificial Intelligence: A Modern Approach
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Annotation-based empirical performance tuning using Orio
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Loop transformation recipes for code generation and auto-tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
The development of optimized codes is time-consuming and requires extensive architecture, compiler, and language expertise, therefore, computational scientists are often forced to choose between investing considerable time in tuning code or accepting lower performance. In this paper, we describe the first steps toward a fully automated system for the optimization of the matrix algebra kernels that are a foundational part of many scientific applications. To generate highly optimized code from a high-level MATLAB prototype, we define a three-step approach. To begin, we have developed a compiler that converts a MATLAB script into simple C code. We then use the polyhedral optimization system Pluto to optimize that code for coarse-grained parallelism and locality simultaneously. Finally, we annotate the resulting code with performance-tuning directives and use the empirical performance-tuning system Orio to generate many tuned versions of the same operation using different optimization techniques, such as loop unrolling and memory alignment. Orio performs an automated empirical search to select the best among the multiple optimized code variants. We discuss performance results on two architectures.