Tolerating latency in multiprocessors through compiler-inserted prefetching

  • Authors:
  • Todd C. Mowry

  • Affiliations:
  • Carnegie Mellon Univ., Pittsburgh, PA

  • Venue:
  • ACM Transactions on Computer Systems (TOCS)
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing instructions to move data close to the processor before the data are actually needed. To minimize the burden on the programmer, compiler support is needed to automatically insert prefetch instructions into the code. A key challenge when inserting prefetches is ensuring that the overheads of prefetching do not outweigh the benefits. While previous studies have demonstrated the effectiveness of hand-inserted prefetching in multiprocessor applications, the benefit of compiler-inserted prefetching in practice has remained an open question. This article proposes and evaluates a new compiler algorithm for inserting prefetches into multiprocessor code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. We have implemented our algorithm in the SUIF(Stanford University Intermediate Format) optimizing compiler. The results of our detailed architectural simulations demonstrate that compiler-inserted prefetching can improve the speed of some parallel applications by as much as a factor of two.