POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Strategies for cache and local memory management by global program transformation
Proceedings of the 1st International Conference on Supercomputing
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Structure of Computers and Computations
Structure of Computers and Computations
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Blocking Linear Algebra Codes for Memory Hierarchies
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Loop Quantization: an Analysis and Algorithm
Loop Quantization: an Analysis and Algorithm
Optimizing supercompilers for supercomputers
Optimizing supercompilers for supercomputers
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Optimization of array accesses by collective loop transformations
ICS '91 Proceedings of the 5th international conference on Supercomputing
Analysis and transformation in the ParaScope editor
ICS '91 Proceedings of the 5th international conference on Supercomputing
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Register allocation via hierarchical graph coloring
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Unexpected side effects of inline substitution: a case study
ACM Letters on Programming Languages and Systems (LOPLAS)
Automatic partitioning of a program dependence graph into parallel tasks
IBM Journal of Research and Development
Register allocation for software pipelined loops
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A practical data flow framework for array reference analysis and its use in optimizations
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A novel framework of register allocation for software pipelining
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compile-time support for efficient data race detection in shared-memory parallel programs
PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Memory access coalescing: a technique for eliminating redundant memory accesses
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Skewed associativity enhances performance predictability
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A limit study of local memory requirements using value reuse profiles
Proceedings of the 28th annual international symposium on Microarchitecture
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The intrinsic bandwidth requirements of ordinary programs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data prefetching and multilevel blocking for linear algebra operations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Block algorithms for sparse matrix computations on high performance workstations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Register promotion in C programs
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A victim cache for vector registers
ICS '97 Proceedings of the 11th international conference on Supercomputing
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting pointer analysis to work
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Path-sensitive value-flow analysis
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing
IEEE Transactions on Computers
Quantitative Evaluation of Register Pressure on Software Pipelined Loops
International Journal of Parallel Programming
Load-reuse analysis: design and evaluation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks
ACM Transactions on Computer Systems (TOCS)
Proceedings of the 14th international conference on Supercomputing
Optimized unrolling of nested loops
Proceedings of the 14th international conference on Supercomputing
Unroll-based register coalescing
Proceedings of the 14th international conference on Supercomputing
From flop to megaflops: Java for technical computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers
The Journal of Supercomputing
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Eliminating redundancies in sum-of-product array computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Computer aided hand tuning (CAHT): “applying case-based reasoning to performance tuning”
ICS '01 Proceedings of the 15th international conference on Supercomputing
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Loop Transformations for Architectures with Partitioned Register Banks
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
C Compiler Design for an Industrial Network Processor
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
Tera hardware-software cooperation
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A compiler approach to fast hardware design space exploration in FPGA-based systems
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Space-time trade-off optimization for a class of electronic structure calculations
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library
ICS '02 Proceedings of the 16th international conference on Supercomputing
Global array reference allocation
ACM Transactions on Design Automation of Electronic Systems (TODAES)
International Journal of Parallel Programming
Optimized Unrolling of Nested Loops
International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
International Journal of Parallel Programming
An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing
IEEE Transactions on Computers
Skewed Associativity Improves Program Performance and Enhances Predictability
IEEE Transactions on Computers
Interactive Parallel Programming using the ParaScope Editor
IEEE Transactions on Parallel and Distributed Systems
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Efficient Pipelining of Nested Loops: Unroll-and-Squash
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
From Flop to MegaFlops: Java for Technical Computing
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Slicing for Locality
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
A Blocked All-Pairs Shortest-Path Algorithm
SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
Efficient Sorting Using Registers and Caches
WAE '00 Proceedings of the 4th International Workshop on Algorithm Engineering
Address Code and Arithmetic Optimizations for Embedded Systems
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
IEEE Transactions on Parallel and Distributed Systems
Efficient sorting using registers and caches
Journal of Experimental Algorithmics (JEA)
Vectorizing for a SIMdD DSP architecture
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Register allocation for optimal loop scheduling
CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
An experimental evaluation of scalar replacement on scientific benchmarks
Software—Practice & Experience
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A data locality optimizing algorithm
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A blocked all-pairs shortest-paths algorithm
Journal of Experimental Algorithmics (JEA)
Applications of storage mapping optimization to register promotion
Proceedings of the 18th annual international conference on Supercomputing
An Integrated Approach for Improving Cache Behavior
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Optimizing Address Code Generation for Array-Intensive DSP Applications
Proceedings of the international symposium on Code generation and optimization
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Dynamic loop pipelining in data-driven architectures
Proceedings of the 2nd conference on Computing frontiers
A case for a working-set-based memory hierarchy
Proceedings of the 2nd conference on Computing frontiers
Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam
International Journal of High Performance Computing Applications
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Improving power efficiency with compiler-assisted cache replacement
Journal of Embedded Computing - Cache exploitation in embedded systems
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Exploiting virtual registers to reduce pressure on real registers
ACM Transactions on Architecture and Code Optimization (TACO)
Compiling for an indirect vector register architecture
Proceedings of the 5th conference on Computing frontiers
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Redundancy elimination revisited
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Exploiting loop-dependent stream reuse for stream processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Convergent Compilation Applied to Loop Unrolling
Transactions on High-Performance Embedded Architectures and Compilers I
Mapping the LU decomposition on a many-core architecture: challenges and solutions
Proceedings of the 6th ACM conference on Computing frontiers
Compact multi-dimensional kernel extraction for register tiling
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing and auto-tuning belief propagation on the GPU
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic parallelization via matrix multiplication
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Combined ILP and register tiling: analytical model and optimization framework
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Analytic models and empirical search: a hybrid approach to code optimization
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A methodology for procedure cloning
Computer Languages
DeadSpy: a tool to pinpoint program inefficiencies
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Memory Latency Hiding by Load Value Speculation for Reconfigurable Computers
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Improved loop tiling based on the removal of spurious false dependences
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Interprocedural strength reduction of critical sections in explicitly-parallel programs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.01 |
Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables.In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective—capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.