Improving the ratio of memory operations to floating-point operations in loops

Authors:
Steve Carr;Ken Kennedy
Affiliations:
Michigan Technological Univ., Houghton;Rice Univ., Houston, TX
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
1994

Citing 20
Cited 58

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop skewing: the wavefront method revisited

International Journal of Parallel Programming
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Overlapped loop support in the Cydra 5

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Register allocation via graph coloring

Register allocation via graph coloring
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Memory-hierarchy management

Memory-hierarchy management
The Generation of Optimal Code for Arithmetic Expressions

Journal of the ACM (JACM)
Register allocation by priority-based coloring

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Structure of Computers and Computations

Structure of Computers and Computations
Data Flow and Dependence Analysis for Instruction Level Parallelism

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Loop Quantization: an Analysis and Algorithm

Loop Quantization: an Analysis and Algorithm

Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Load-reuse analysis: design and evaluation

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Loop Transformations for Architectures with Partitioned Register Banks

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Optimized Unrolling of Nested Loops

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Combining Loop Transformations Considering Caches and Scheduling

International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Compiler-Directed Dynamic Frequency and Voltage Scheduling

PACS '00 Proceedings of the First International Workshop on Power-Aware Computer Systems-Revised Papers
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Using estimates from behavioral synthesis tools in compiler-directed design space exploration

Proceedings of the 40th annual Design Automation Conference
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Software pipelining: an effective scheduling technique for VLIW machines

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Applications of storage mapping optimization to register promotion

Proceedings of the 18th annual international conference on Supercomputing
An innovative low-power high-performance programmable signal processor for digital communications

IBM Journal of Research and Development
The Energy Impact of Aggressive Loop Fusion

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

International Journal of High Performance Computing Applications
Automatic tuning of whole applications using direct search and a performance-based transformation system

The Journal of Supercomputing
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
The impact of loop unrolling on controller delay in high level synthesis

Proceedings of the conference on Design, automation and test in Europe
An operation stacking framework for large ensemble computations

Proceedings of the 21st annual international conference on Supercomputing
Analyzing memory access intensity in parallel programs on multicore

Proceedings of the 22nd annual international conference on Supercomputing
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
Automatic analysis for managing and optimizing performance-code quality

Proceedings of the 2008 workshop on Static analysis
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Overview of Multicore Requirements towards Real-Time Communication

SEUS '09 Proceedings of the 7th IFIP WG 10.2 International Workshop on Software Technologies for Embedded and Ubiquitous Systems
Instruction balance and its relation to program energy consumption

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
World-highest resolution global atmospheric model and its performance on the Earth Simulator

State of the Practice Reports
Combined ILP and register tiling: analytical model and optimization framework

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Applying loop optimizations to object-oriented abstractions through general classification of array semantics

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Extending the applicability of scalar replacement to multiple induction variables

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
Loop acceleration exploration for ASIP architecture

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Because computations often require more data from cache per floating-point operation than a machine can deliver and because operations are pipelined, idle computational cycles are common when scientific applications are executed. To overcome these bottlenecks, programmers have learned to use a coding style that ensures a better balance between memory references and floating-point operations. In our view, this is a step in the wrong direction because it makes programs more machine-specific. A programmer should not be required to write a new program version for each new machine; instead, the task of specializing a program to a target machine should be left to the compiler.But is our view practical? Can a sophisticated optimizing compiler obviate the need for the myriad of programming tricks that have found their way into practice to improve the performance of the memory hierarchy? In this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they estimate statically the balance between memory operations and floating-point operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations.Experiments with our automatic techniques show that integer-factor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurate—experiments reveal little difference between the balance achieved by our automatic system that is made possible by hand optimization.