Data prefetching via speculative precomputation on a simultaneous multithreaded processor

Authors:
Jamison D. Collins;Dean Tullsen
Affiliations:
-;-
Venue:
Data prefetching via speculative precomputation on a simultaneous multithreaded processor
Year:
2004

Citing 0
Cited 1

Trace Cache Miss Rate

International Journal of Modelling and Simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This dissertation explores Speculative Precomputation, a new, heavyweight, prefetching technique geared at improving the performance of those loads in a program which are most difficult to attack through traditional techniques. Historically, this problem has been attacked in one of three ways: keeping important data near the processor (caches), bringing distant data to the processor ahead of time (prefetching), or, doing something else while waiting for the data (multithreading). Despite advancements in these areas, a small fraction of static loads, which we call delinquent loads, continues to exhibit poor memory behavior. Because they access both large regions of memory, and access data in unpredictable patterns, delinquent loads foil both caches and traditional prefetching schemes. Rather than predicting future memory accesses for delinquent loads (a difficult task), these accesses are pre-computed by executing slices of main thread instructions speculatively, on otherwise idle multithreading hardware. By executing a subset of main thread instructions, even unpredictable accesses are computed accurately. Such an approach is very general, and capable of targeting a wide range of instructions. It is also very specific, as careful selection of the instructions to be speculatively executed ensures precisely tuned prefetching behavior for each targeted load. This dissertation presents two forms of Speculative Precomputation. The first is a software-based scheme, in which all necessary program and instruction analysis is carried out offline, via compiler and profiling analysis. We also explore a second, complementary approach, which instead performs all analysis dynamically, by adding back-end instruction analysis hardware to the processor. We found both approaches, each representing an extreme in terms of SP implementations, to yield significant performance gains. By extending our dynamic approach with knowledge of a program's control reconvergence behavior, more aggressive and effective slices can be constructed by bringing the hardware scheme's program knowledge more into parity with that afforded to the software-based scheme. Therefore, we also propose a novel technique to predict reconvergence behavior for a program's branches with high accuracy. In addition to extending our hardware-based SP approach, dynamic reconvergence prediction has applications for speculative multithreading, instruction reuse, and reducing fetched wrong path instructions.