Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Adaptive Strassen's matrix multiplication
Proceedings of the 21st annual international conference on Supercomputing
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Exploring the Optimization Space of Dense Linear Algebra Kernels
Languages and Compilers for Parallel Computing
Engineering A Compiler
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automatic Library Generation for BLAS3 on GPUs
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Automated programmable control and parameterization of compiler optimizations
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations
Software—Practice & Experience
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Layout-oblivious compiler optimization for matrix computations
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
Hi-index | 0.00 |
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.