Memory access coalescing: a technique for eliminating redundant memory accesses

Authors:
Jack W. Davidson;Sanjay Jinturkar
Affiliations:
Department of Computer Science, Thomton Hall, University of Virginia, Charlottesville, VA, U.S.A.;Department of Computer Science, Thomton Hall, University of Virginia, Charlottesville, VA, U.S.A.
Venue:
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Year:
1994

Citing 15
Cited 22

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
A portable global optimizer and linker

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The priority-based coloring approach to register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Practical image processing in C: acquisition, manipulation and storage: hardware, software, images and text

Practical image processing in C: acquisition, manipulation and storage: hardware, software, images and text
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Procedure merging with instruction caches

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A safe approximate algorithm for interprocedural aliasing

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Software support for speculative loads

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Interprocedural modification side effect analysis with pointer aliasing

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The advantages of machine-dependent global optimization

Proceedings of the international conference on Programming languages and system architectures
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture

Integrating program transformations in the memory-based synthesis of image and video algorithms

ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation

Proceedings of the 28th annual international symposium on Microarchitecture
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Bit section instruction set extension of ARM for embedded applications

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Bitwidth aware global register allocation

POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A Framework for Parallelizing Load/Stores on Embedded Processors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Using a Swap Instruction to Coalesce Loads and Stores

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Representation for Bit Section Based Analysis and Optimization

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data Compression Transformations for Dynamically Allocated Data Structures

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
Automatic generation of peephole optimizations

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A unified framework for nonlinear dependence testing and symbolic analysis

Proceedings of the 18th annual international conference on Supercomputing
Background Data Organisation for the Low-Power Implementation in Real-Time of a Digital Audio Broadcast Receiver on a SIMD Processor

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
Shangri-La: achieving high performance from compiled network applications while enabling ease of programming

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Instruction combining for coalescing memory accesses using global code motion

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Parallelizing load/stores on dual-bank memory embedded processors

ACM Transactions on Embedded Computing Systems (TECS)
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Optimizing packet accesses for a domain specific language on network processors

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Enhanced bitwidth-aware register allocation

CC'06 Proceedings of the 15th international conference on Compiler Construction
Physically addressed queueing (PAQ): improving parallelism in solid state disks

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved hardware solutions. Unfortunately, these solutions do little to help the situation with respect to current microprocessors. In previous work, we developed, implemented, and evaluated an algorithm that exploited the ability of newer machines with wide-buses to load/store multiple floating-point operands in a single memory reference. This paper describes a general code improvement algorithm that transforms code to better exploit the available memory bandwidth on existing microprocessors as well as wide-bus machines. Where possible and advantageous, the algorithm coalesces narrow memory references into wide ones. An interesting characteristic of the algorithm is that some decisions about the applicability of the transformation are made at run time. This dynamic analysis significantly increases the probability of the transformation being applied. The code improvement transformation was implemented and added to the repertoire of code improvements of an existing retargetable optimizing back end. Using three current architectures as evaluation platforms, the effectiveness of the transformation was measured on a set of compute- and memory-intensive programs. Interestingly, the effectiveness of the transformation varied significantly with respect to the instruction-set architecture of the tested platform. For one of the tested architectures, improvements in execution speed ranging from 5 to 40 percent were observed. For another, the improvements in execution speed ranged from 5 to 20 percent, while for yet another, the transformation resulted in slower code for all programs.