Compiler Optimizations for Cache Locality and Coherence

  • Authors:
  • Wei Li

  • Affiliations:
  • -

  • Venue:
  • Compiler Optimizations for Cache Locality and Coherence
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

Almost every modern processor is designed with a memory hierarchy organized into several levels, each of which is smaller, faster, and more expensive than the level below. High performance requires the effective use of the cached data, i.e., cache locality. Smart compiler transformations can relieve the programmer from hand-optimizing for the specific machine architectures. .pp In a multiprocessor system, data inconsistency may occur between memory and caches. For example, the memory and multiple caches may have inconsistent copies of the same cache block. This introduces the problem of cache coherence. Several cache coherence protocols have been developed to maintain data coherence for multiple processors. Since multiple variables are located in the same block, it may cause the problem of false sharing, which has been identified by many researchers as a major obstacle to high performance. Therefore, in a multiprocessor system, we need to avoid false sharing as well as exploit cache locality. .pp In this paper, we first develop a new data reuse model and an algorithm called height reduction to improve cache locality. The advantage of this algorithm is that it can improve band matrix programs as well as dense matrix programs. It is more accurate and general than the existing techniques on improving cache locality, which were developed to optimize dense matrix programs. Then with the height reduction algorithm, we extend loop tiling to exploit not only intra-tile data locality but also inter-tile data locality. We call the new tiling affinity tiling. Our experiments show that affinity tiling is less sensitive to the choice of the tile size. Finally, we show that the algorithm also helps to eliminate or reduce false sharing in multiprocessor systems. With the height reduction algorithm and affinity tiling, significant performance improvement (speedups from 2.5 to 10) has been observed on HP workstations and KSR1 multiprocessors.