Reducing latencies of pipelined cache accesses through set prediction

Authors:
Aneesh Aggarwal
Affiliations:
Binghamton University, Binghamton, NY
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 17
Cited 1

An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Streamlining data cache access with fast address calculation

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Zero-cycle loads: microarchitecture support for reducing load latency

Proceedings of the 28th annual international symposium on Microarchitecture
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Speculative execution via address prediction and data prefetching

ICS '97 Proceedings of the 11th international conference on Supercomputing
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Power considerations in the design of the Alpha 21264 microprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors

DAC '98 Proceedings of the 35th annual Design Automation Conference
Pipeline gating: speculation control for energy reduction

Proceedings of the 25th annual international symposium on Computer architecture
Correlated load-address predictors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Predictive sequential associative cache

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Wire Delay is Not a Problem for SMT (In the Near Future)

Proceedings of the 31st annual international symposium on Computer architecture

On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing performance gap between the processor and the memory, the importance of caches is increasing for high performance processors. However, with reducing feature sizes and increasing clock speeds, cache access latencies are increasing. Designers pipeline the cache accesses to prevent the increasing latencies from affecting the cache throughput. Nevertheless, increasing latencies can degrade the performance significantly by delaying the execution of dependent instructions.In this paper, we investigate predicting the data cache set and the tag of the memory address as a means to reduce the effective cache access latency. In this technique, the predicted set is used to start the pipelined cache access in parallel to the memory address computation. We also propose a set-address adaptive predictor to improve the prediction accuracy of the data cache sets. Our studies found that using set prediction to reduce load-to-use latency can improve the overall performance of the processor by as much as 24%. In this paper, we also investigate techniques, such as predicting the data cache line where the data will be present, to limit the increase in cache energy consumption when using set prediction. In fact, with line prediction, the techniques in this paper consume about 15% less energy in the data cache than a decoupled-accessed cache with minimum energy consumption, while still maintaining the performance improvement. However, the overall energy consumption is about 35% more than a decoupled-accessed cache when the energy consumption in the predictor table is also considered.