Layout-oblivious compiler optimization for matrix computations

Authors:
Huimin Cui;Qing Yi;Jingling Xue;Xiaobing Feng
Affiliations:
CAS;University of Colarodo at Colorado Springs;University of New South Wales;CAS
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 24
Cited 1

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Loop tiling for parallelism

Loop tiling for parallelism
DyC: an expressive annotation-directed dynamic compiler for C

Theoretical Computer Science - Partial evaluation and semantics-based program manipulation
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
A Data Abstraction Alternative to Data Structure/Algorithm Modularization

Selected Papers from the International Seminar on Generic Programming
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A program testing system

ACM '76 Proceedings of the 1976 annual conference
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Enabling Loop Fusion and Tiling for Cache Performance by Fixing Fusion-Preventing Data Dependences

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Symbolic Evaluation and the Analysis of Programs

IEEE Transactions on Software Engineering
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A survey of new trends in symbolic execution for software testing and analysis

International Journal on Software Tools for Technology Transfer (STTT) - Special Section on HVC 07
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The tao of parallelism in algorithms

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic Library Generation for BLAS3 on GPUs

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Enhancing the Role of Inlining in Effective Interprocedural Parallelization

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to demonstrate that complex internal organizations of data structures can severely degrade the effectiveness of compiler optimizations. We then present a data-layout-oblivious optimization methodology, where by isolating an abstract representation of the computations from complex implementation details of their data, we enable these computations to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. We evaluated our approach on an Intel 8-core platform using two source-to-source compiler infrastructures, Pluto and EPOD. Our results show that while the efficiency of a computational kernel differs when using different data layouts, the alternative implementations typically benefit from a common set of optimizations on the operations. Therefore separately optimizing the operations and the data layout of a computation could dramatically enhance the effectiveness of compiler optimizations compared with the conventional approaches of using a unified representation.