Improving register allocation for subscripted variables

Authors:
David Callahan;Steve Carr;Ken Kennedy
Affiliations:
Tera Computer Company, 400 N 34th St, Suite 300, Seattle, Washington;Department of Computer Science, Rice University, Houston, Texas;Department of Computer Science, Rice University, Houston, Texas
Venue:
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Year:
1990

Citing 10
Cited 110

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Structure of Computers and Computations

Structure of Computers and Computations
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Blocking Linear Algebra Codes for Memory Hierarchies

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Loop Quantization: an Analysis and Algorithm

Loop Quantization: an Analysis and Algorithm
Optimizing supercompilers for supercomputers

Optimizing supercompilers for supercomputers
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Analysis and transformation in the ParaScope editor

ICS '91 Proceedings of the 5th international conference on Supercomputing
Practical dependence testing

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Register allocation via hierarchical graph coloring

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Unexpected side effects of inline substitution: a case study

ACM Letters on Programming Languages and Systems (LOPLAS)
Automatic partitioning of a program dependence graph into parallel tasks

IBM Journal of Research and Development
Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A novel framework of register allocation for software pipelining

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compile-time support for efficient data race detection in shared-memory parallel programs

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Memory access coalescing: a technique for eliminating redundant memory accesses

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Zero-cost range splitting

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Skewed associativity enhances performance predictability

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A limit study of local memory requirements using value reuse profiles

Proceedings of the 28th annual international symposium on Microarchitecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The intrinsic bandwidth requirements of ordinary programs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data prefetching and multilevel blocking for linear algebra operations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Block algorithms for sparse matrix computations on high performance workstations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Register promotion in C programs

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A victim cache for vector registers

ICS '97 Proceedings of the 11th international conference on Supercomputing
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting pointer analysis to work

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Path-sensitive value-flow analysis

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
Quantitative Evaluation of Register Pressure on Software Pipelined Loops

International Journal of Parallel Programming
Load-reuse analysis: design and evaluation

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
Unroll-based register coalescing

Proceedings of the 14th international conference on Supercomputing
From flop to megaflops: Java for technical computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers

The Journal of Supercomputing
Data locality enhancement by memory reduction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Eliminating redundancies in sum-of-product array computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Computer aided hand tuning (CAHT): “applying case-based reasoning to performance tuning”

ICS '01 Proceedings of the 15th international conference on Supercomputing
Characterizing the memory behavior of Java workloads: a structured view and opportunities for optimizations

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Loop Transformations for Architectures with Partitioned Register Banks

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
C Compiler Design for an Industrial Network Processor

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Tera hardware-software cooperation

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library

ICS '02 Proceedings of the 16th international conference on Supercomputing
Global array reference allocation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Fast Greedy Weighted Fusion

International Journal of Parallel Programming
Optimized Unrolling of Nested Loops

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing

IEEE Transactions on Computers
Skewed Associativity Improves Program Performance and Enhances Predictability

IEEE Transactions on Computers
Interactive Parallel Programming using the ParaScope Editor

IEEE Transactions on Parallel and Distributed Systems
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Efficient Pipelining of Nested Loops: Unroll-and-Squash

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
From Flop to MegaFlops: Java for Technical Computing

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Slicing for Locality

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
A Blocked All-Pairs Shortest-Path Algorithm

SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
Efficient Sorting Using Registers and Caches

WAE '00 Proceedings of the 4th International Workshop on Algorithm Engineering
Address Code and Arithmetic Optimizations for Embedded Systems

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Efficient sorting using registers and caches

Journal of Experimental Algorithmics (JEA)
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Register allocation for optimal loop scheduling

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
An experimental evaluation of scalar replacement on scientific benchmarks

Software—Practice & Experience
Automatic loop interchange

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A blocked all-pairs shortest-paths algorithm

Journal of Experimental Algorithmics (JEA)
Applications of storage mapping optimization to register promotion

Proceedings of the 18th annual international conference on Supercomputing
An Integrated Approach for Improving Cache Behavior

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Optimizing Address Code Generation for Array-Intensive DSP Applications

Proceedings of the international symposium on Code generation and optimization
A Register Allocation Algorithm in the Presence of Scalar Replacement for Fine-Grain Configurable Architectures

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Dynamic loop pipelining in data-driven architectures

Proceedings of the 2nd conference on Computing frontiers
A case for a working-set-based memory hierarchy

Proceedings of the 2nd conference on Computing frontiers
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

International Journal of High Performance Computing Applications
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Improving power efficiency with compiler-assisted cache replacement

Journal of Embedded Computing - Cache exploitation in embedded systems
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Exploiting virtual registers to reduce pressure on real registers

ACM Transactions on Architecture and Code Optimization (TACO)
Compiling for an indirect vector register architecture

Proceedings of the 5th conference on Computing frontiers
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Redundancy elimination revisited

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Convergent Compilation Applied to Loop Unrolling

Transactions on High-Performance Embedded Architectures and Compilers I
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
Compact multi-dimensional kernel extraction for register tiling

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing and auto-tuning belief propagation on the GPU

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic parallelization via matrix multiplication

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Combined ILP and register tiling: analytical model and optimization framework

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Analytic models and empirical search: a hybrid approach to code optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A methodology for procedure cloning

Computer Languages
DeadSpy: a tool to pinpoint program inefficiencies

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Memory Latency Hiding by Load Value Speculation for Reconfigurable Computers

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Improved loop tiling based on the removal of spurious false dependences

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Interprocedural strength reduction of critical sections in explicitly-parallel programs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables.In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective—capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.