Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop skewing: the wavefront method revisited
International Journal of Parallel Programming
Automatic decomposition of scientific programs for parallel execution
POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Strategies for cache and local memory management by global program transformation
Proceedings of the 1st International Conference on Supercomputing
Coloring heuristics for register allocation
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
IEEE Transactions on Computers
A practical data flow framework for array reference analysis and its use in optimizations
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Scalar replacement in the presence of conditional control flow
Software—Practice & Experience
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Optimized unrolling of nested loops
Proceedings of the 14th international conference on Supercomputing
The parallel execution of DO loops
Communications of the ACM
Register allocation by priority-based coloring
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Structure of Computers and Computations
Structure of Computers and Computations
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Blocking Linear Algebra Codes for Memory Hierarchies
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Optimizing Loop Performance for Clustered VLIW Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Register allocation & spilling via graph coloring
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Loop Quantization: an Analysis and Algorithm
Loop Quantization: an Analysis and Algorithm
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Improving the performance of virtual memory computers.
Improving the performance of virtual memory computers.
Optimizing supercompilers for supercomputers
Optimizing supercompilers for supercomputers
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
On minimizing register usage of linearly scheduled algorithms with uniform dependencies
Computer Languages, Systems and Structures
Hi-index | 0.00 |
Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables.In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective---capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.