Reducing the latency of L2 misses in shared-memory multiprocessors through on-chip directory integration

  • Authors:
  • Manuel E. Acacio;José González;José M. García;José Duato

  • Affiliations:
  • Dpto. Ing. y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain;Dpto. Ing. y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain;Dpto. Ing. y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain;Dpto. Inf. de Sistemas y Computadores, Universidad Politécnica de Valencia, Valencia, Spain

  • Venue:
  • EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller and the network interface. In this work we exploit such integration scale, presenting a new three-level directory architecture aimed at reducing the long L2 miss latencies and the memory overhead that characterize cc-NUMA machines and limit their scalability. The proposed architecture is based on the integration into the processor chip of the directory controller and a small first-level directory cache that stores precise information for the most recently referenced memory lines, as the means to reduce miss latencies. The second- and third-level directories are located near main memory and they are only accessed when a directory entry for a certain memory line is not present in the first-level directory. This off-chip structure achieves the performance of a big and nonscalable full-map directory with a very significant reduction in the memory overhead. Using execution-driven simulations, we show that substantial latency reductions can be obtained by using the proposed directory architecture. Load, store and read-modify-write misses are significantly accelerated (latency reductions of more than 35% in some cases). These reductions translate into important improvements on the final application performance (reductions up to 20% in execution time).