Instruction combining for coalescing memory accesses using global code motion

  • Authors:
  • Motohiro Kawahito;Hideaki Komatsu;Toshio Nakatani

  • Affiliations:
  • IBM Tokyo Research Laboratory, Shimotsuruma, Yamato, Kanagawa, Japan;IBM Tokyo Research Laboratory, Shimotsuruma, Yamato, Kanagawa, Japan;IBM Tokyo Research Laboratory, Shimotsuruma, Yamato, Kanagawa, Japan

  • Venue:
  • MSP '04 Proceedings of the 2004 workshop on Memory system performance
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we use it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider reference for taking advantage of a wide-bus architecture. Coalescing memory accesses can improve performance for two reasons: one by reducing the additional cycles required for moving data from caches to registers and the other by reducing the stall cycles caused by multiple outstanding memory access requests. Previous approaches for memory access coalescing focus only on array access instructions related to loop induction variables, and thus they miss many other opportunities. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in the IBM Java™ JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. Moreover, when we replace every declaration of double for an instance variable with float, we can improve the performance by 7.3% for the MolDyn benchmark in the JavaGrande benchmark suite. Our approach can be applied to a variety of architectures and to programming languages besides Java.