Memory access coalescing: a technique for eliminating redundant memory accesses

  • Authors:
  • Jack W. Davidson;Sanjay Jinturkar

  • Affiliations:
  • Department of Computer Science, Thomton Hall, University of Virginia, Charlottesville, VA, U.S.A.;Department of Computer Science, Thomton Hall, University of Virginia, Charlottesville, VA, U.S.A.

  • Venue:
  • PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved hardware solutions. Unfortunately, these solutions do little to help the situation with respect to current microprocessors. In previous work, we developed, implemented, and evaluated an algorithm that exploited the ability of newer machines with wide-buses to load/store multiple floating-point operands in a single memory reference. This paper describes a general code improvement algorithm that transforms code to better exploit the available memory bandwidth on existing microprocessors as well as wide-bus machines. Where possible and advantageous, the algorithm coalesces narrow memory references into wide ones. An interesting characteristic of the algorithm is that some decisions about the applicability of the transformation are made at run time. This dynamic analysis significantly increases the probability of the transformation being applied. The code improvement transformation was implemented and added to the repertoire of code improvements of an existing retargetable optimizing back end. Using three current architectures as evaluation platforms, the effectiveness of the transformation was measured on a set of compute- and memory-intensive programs. Interestingly, the effectiveness of the transformation varied significantly with respect to the instruction-set architecture of the tested platform. For one of the tested architectures, improvements in execution speed ranging from 5 to 40 percent were observed. For another, the improvements in execution speed ranged from 5 to 20 percent, while for yet another, the transformation resulted in slower code for all programs.