Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficient and exact data dependence analysis
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Software pipelining showdown: optimal vs. heuristic methods in a production compiler
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An experimental evaluation of tiling and shackling for memory hierarchy management
ICS '99 Proceedings of the 13th international conference on Supercomputing
Locality optimizations for multi-level caches
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
Energy-driven integrated hardware-software optimizations using SimplePower
Proceedings of the 27th annual international symposium on Computer architecture
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
A compiler technique for improving whole-program locality
POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Tiling imperfectly-nested loop nests
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching
Journal of VLSI Signal Processing Systems
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Hardware and Software Techniques for Controlling DRAM Power Modes
IEEE Transactions on Computers
Characterization of memory energy behavior
Workload characterization of emerging computer applications
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Handling Global Constraints in Compiler Strategy
International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement
International Journal of Parallel Programming
Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework
IEEE Transactions on Computers
Partitioned instruction cache architecture for energy efficiency
ACM Transactions on Embedded Computing Systems (TECS)
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
A Compiler Framework for Tiling Imperfectly-Nested Loops
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Experimental Evaluation of Energy Behavior of Iteration Space Tiling
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Enhancing Compiler Techniques for Memory Energy Optimizations
EMSOFT '02 Proceedings of the Second International Conference on Embedded Software
Cache Line Impact on 3D PDE Solvers
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
Compiler optimization-space exploration
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Compile-time composition of run-time data and iteration reorderings
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
ACM Transactions on Programming Languages and Systems (TOPLAS)
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
Instruction Scheduling for Low Power
Journal of VLSI Signal Processing Systems
Single-Dimension Software Pipelining for Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Software pipelining: an effective scheduling technique for VLIW machines
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Improving register allocation for subscripted variables
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
A Model-Based Framework: An Approach for Profit-Driven Optimization
Proceedings of the international symposium on Code generation and optimization
A Geometric Programming Framework for Optimal Multi-Level Tiling
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fast and efficient searches for effective optimization-phase sequences
ACM Transactions on Architecture and Code Optimization (TACO)
Improving whole-program locality using intra-procedural and inter-procedural transformations
Journal of Parallel and Distributed Computing
Compiler-directed high-level energy estimation and optimization
ACM Transactions on Embedded Computing Systems (TECS)
The Journal of Supercomputing
Reducing NoC energy consumption through compiler-directed channel voltage scaling
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Performance Modeling of Communication and Computation in Hybrid MPI and OpenMP Applications
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
An approach toward profit-driven optimization
ACM Transactions on Architecture and Code Optimization (TACO)
Profitable loop fusion and tiling using model-driven empirical search
Proceedings of the 20th annual international conference on Supercomputing
Single-dimension software pipelining for multidimensional loops
ACM Transactions on Architecture and Code Optimization (TACO)
Improving power efficiency with compiler-assisted cache replacement
Journal of Embedded Computing - Cache exploitation in embedded systems
Fast, accurate design space exploration of embedded systems memory configurations
Proceedings of the 2007 ACM symposium on Applied computing
Microarchitecture Sensitive Empirical Models for Compiler Optimizations
Proceedings of the International Symposium on Code Generation and Optimization
Designer-controlled generation of parallel and flexible heterogeneous MPSoC specification
Proceedings of the 44th annual Design Automation Conference
PEAK—a fast and effective performance tuning system via compiler optimization orchestration
ACM Transactions on Programming Languages and Systems (TOPLAS)
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Software-directed combined cpu/link voltage scaling fornoc-based cmps
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Capturing performance knowledge for automated analysis
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A compiler-directed data prefetching scheme for chip multiprocessors
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-aware partitioning of multi-dimensional iteration spaces
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Exposing tunable parameters in multi-threaded numerical code
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
On the interaction of tiling and automatic parallelization
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Loop transformations: convexity, pruning and optimization
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Optimizing integrated application performance with cache-aware metascheduling
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
Combined ILP and register tiling: analytical model and optimization framework
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
An overview of the open research compiler
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
An ILP-Based approach to locality optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A matrix-type for performance–portability
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A hybrid strategy based on data distribution and migration for optimizing memory locality
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Neighborhood-aware data locality optimization for NoC-based multicores
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
MAO -- An extensible micro-architectural optimizer
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Experiences Developing the OpenUH Compiler and Runtime Infrastructure
International Journal of Parallel Programming
Hi-index | 0.01 |
The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs.