Dynamic memory instruction bypassing

Authors:
Daniel Ortega;Eduard Ayguadé;Mateo Valero
Affiliations:
Universidad Politécnica de Cataluña, Barcelona, Spain;Universidad Politécnica de Cataluña, Barcelona, Spain;Universidad Politécnica de Cataluña, Barcelona, Spain
Venue:
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Year:
2003

Citing 11
Cited 1

Zero-cycle loads: microarchitecture support for reducing load latency

Proceedings of the 28th annual international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Read-after-read memory dependence prediction

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Direct load: dependence-linked dataflow resolution of load address and cache coordinate

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Virtual-Physical Registers

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture

On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reducing the latency of load instructions is among the most crucial aspects to achieve performance for current and future microarchitectures. Deep pipelining makes L1 caches appear farther than 1 cycle, thus impacting load-to-use latency, even if these instructions hit in cache. In this paper we present a novel dynamic mechanism aimed at overcoming load-to-use latency. Our mechanism dynamically detects relations between address producing instructions. and the loads that consume these addresses and uses this information to access data before the load is even fetched from the I-Cache. We modify the renaming stage so that when these loads are fetched, they are detected and consequently squashed, since their work has already taken place. By fetching data ahead of time, our mechanism allows the microarchitecture to see further in the future, a concept akin to having a bigger reorder buffer. This mechanism is not intended to prefetch from outside the chip (main memory or L3 cache if present). Its main aim is to move data from L1 and L2 silently and ahead of time into the register file so that the load instruction can be subsequently bypassed (hence the name). This mechanisms benefits increase in the presence of memory prefetching or a good memory behaviour, since these scenarios allow for the bypassing of more loadsJ. Besides, a better use of renaming registers allows our mechanism to outperform the baseline even when the latter has more renaming registers. An average performance improvement of 24.5% is achieved in the SPECint95 benchmarks.