Compiler Support for Data Forwarding in Scalable Shared-Memory Multiprocessors

Authors:
David Koufaty;Josep Torrellas
Affiliations:
-;-
Venue:
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Year:
1999

Citing 0
Cited 3

System-wide performance monitors and their application to the optimization of coherent memory accesses

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing embedded applications using programmer-inserted hints

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Compiler optimization techniques for OpenMP programs

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the difference in speed between processor and memory system continues to increase, it is becoming crucial to develop and refine techniques that enhance the effectiveness of cache hierarchies. One promising technique in the context of scalable shared-memory multiprocessors is data forwarding. Forwarding hides the latency of communication-induced misses by having producer processors send data to the caches of potential consumer processors in advance. Forwarding can hide the latency effectively, has low instruction overhead, and uses few machine resources.This paper presents a complete implementation of a data forwarding pass in an industrial-strength parallelizing compiler. Complete Fortran applications are analyzed for dependences and, based on the analysis, automatically annotated with forwarding directives. We propose a forwarding framework that includes 4 new instructions: write-forward, write-broadcast, write-update}, and write-through. New micro-architectural support is proposed.In our analysis, we assume that the assignment of loop iterations to processors is known. We perform simulations of multiprocessors with different cache, memory, machine sharing, and process migration parameters. We conclude that data forwarding delivers large speedups (six 32-processor applications ran an average of 40% faster), gets close to the upper bound in performance, and needs compiler support of only medium complexity.