A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Loop tiling for parallelism
DyC: an expressive annotation-directed dynamic compiler for C
Theoretical Computer Science - Partial evaluation and semantics-based program manipulation
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
A Data Abstraction Alternative to Data Structure/Algorithm Modularization
Selected Papers from the International Seminar on Generic Programming
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
ACM '76 Proceedings of the 1976 annual conference
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Enabling Loop Fusion and Tiling for Cache Performance by Fixing Fusion-Preventing Data Dependences
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Symbolic Evaluation and the Analysis of Programs
IEEE Transactions on Software Engineering
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A survey of new trends in symbolic execution for software testing and analysis
International Journal on Software Tools for Technology Transfer (STTT) - Special Section on HVC 07
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The tao of parallelism in algorithms
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic Library Generation for BLAS3 on GPUs
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Enhancing the Role of Inlining in Effective Interprocedural Parallelization
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automated programmable control and parameterization of compiler optimizations
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations
Software—Practice & Experience
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to demonstrate that complex internal organizations of data structures can severely degrade the effectiveness of compiler optimizations. We then present a data-layout-oblivious optimization methodology, where by isolating an abstract representation of the computations from complex implementation details of their data, we enable these computations to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. We evaluated our approach on an Intel 8-core platform using two source-to-source compiler infrastructures, Pluto and EPOD. Our results show that while the efficiency of a computational kernel differs when using different data layouts, the alternative implementations typically benefit from a common set of optimizations on the operations. Therefore separately optimizing the operations and the data layout of a computation could dramatically enhance the effectiveness of compiler optimizations compared with the conventional approaches of using a unified representation.