Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
A portable global optimizer and linker
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The priority-based coloring approach to register allocation
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Practical image processing in C: acquisition, manipulation and storage: hardware, software, images and text
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Procedure merging with instruction caches
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A safe approximate algorithm for interprocedural aliasing
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Software support for speculative loads
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Interprocedural modification side effect analysis with pointer aliasing
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A practical data flow framework for array reference analysis and its use in optimizations
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The advantages of machine-dependent global optimization
Proceedings of the international conference on Programming languages and system architectures
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Integrating program transformations in the memory-based synthesis of image and video algorithms
ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation
Proceedings of the 28th annual international symposium on Microarchitecture
Dynamic Access Ordering for Streamed Computations
IEEE Transactions on Computers
Bit section instruction set extension of ARM for embedded applications
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Bitwidth aware global register allocation
POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A Framework for Parallelizing Load/Stores on Embedded Processors
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Using a Swap Instruction to Coalesce Loads and Stores
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Representation for Bit Section Based Analysis and Optimization
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data Compression Transformations for Dynamically Allocated Data Structures
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Instruction Scheduling for Low Power
Journal of VLSI Signal Processing Systems
Automatic generation of peephole optimizations
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A unified framework for nonlinear dependence testing and symbolic analysis
Proceedings of the 18th annual international conference on Supercomputing
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Predicting Unroll Factors Using Supervised Classification
Proceedings of the international symposium on Code generation and optimization
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Instruction combining for coalescing memory accesses using global code motion
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Parallelizing load/stores on dual-bank memory embedded processors
ACM Transactions on Embedded Computing Systems (TECS)
Optimizing software cache performance of packet processing applications
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Optimizing packet accesses for a domain specific language on network processors
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Enhanced bitwidth-aware register allocation
CC'06 Proceedings of the 15th international conference on Compiler Construction
Physically addressed queueing (PAQ): improving parallelism in solid state disks
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved hardware solutions. Unfortunately, these solutions do little to help the situation with respect to current microprocessors. In previous work, we developed, implemented, and evaluated an algorithm that exploited the ability of newer machines with wide-buses to load/store multiple floating-point operands in a single memory reference. This paper describes a general code improvement algorithm that transforms code to better exploit the available memory bandwidth on existing microprocessors as well as wide-bus machines. Where possible and advantageous, the algorithm coalesces narrow memory references into wide ones. An interesting characteristic of the algorithm is that some decisions about the applicability of the transformation are made at run time. This dynamic analysis significantly increases the probability of the transformation being applied. The code improvement transformation was implemented and added to the repertoire of code improvements of an existing retargetable optimizing back end. Using three current architectures as evaluation platforms, the effectiveness of the transformation was measured on a set of compute- and memory-intensive programs. Interestingly, the effectiveness of the transformation varied significantly with respect to the instruction-set architecture of the tested platform. For one of the tested architectures, improvements in execution speed ranging from 5 to 40 percent were observed. For another, the improvements in execution speed ranged from 5 to 20 percent, while for yet another, the transformation resulted in slower code for all programs.