On reducing load/store latencies of cache accesses

Authors:
Yuan-Shin Hwang;Jia-Jhe Li
Affiliations:
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan;Department of Computer Science, National Tsing Hua University, Hsinchu 300, Taiwan
Venue:
Journal of Systems Architecture: the EUROMICRO Journal
Year:
2010

Citing 28
Cited 0

Streamlining data cache access with fast address calculation

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Zero-cycle loads: microarchitecture support for reducing load latency

Proceedings of the 28th annual international symposium on Microarchitecture
Microarchitecture support for improving the performance of load target prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Multilevel Optimization of Pipelined Caches

IEEE Transactions on Computers
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Correlated load-address predictors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach

ACM Transactions on Computer Systems (TOCS)
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
The Alpha 21264 Microprocessor

IEEE Micro
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic memory instruction bypassing

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Predictive sequential associative cache

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Reducing data cache energy consumption via cached load/store queue

Proceedings of the 2003 international symposium on Low power electronics and design
On load latency in low-power caches

Proceedings of the 2003 international symposium on Low power electronics and design
Microprocessor pipeline energy analysis

Proceedings of the 2003 international symposium on Low power electronics and design
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Signature Buffer: Bridging Performance Gap between Registers and Caches

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Snug set-associative caches: reducing leakage power while improving performance

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Reducing latencies of pipelined cache accesses through set prediction

Proceedings of the 19th annual international conference on Supercomputing
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties

ACM Transactions on Architecture and Code Optimization (TACO)
Reducing non-deterministic loads in low-power caches via early cache set resolution

Microprocessors & Microsystems
Reducing cache misses through programmable decoders

ACM Transactions on Architecture and Code Optimization (TACO)
Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Effective address calculations for load and store instructions need to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations by storing computed effective addresses of previous load/store instructions in a dummy register file. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. Furthermore, when multiple dummy register files are deployed, this fast address generator eliminates over 90% of effective address computations of load and store instructions and improves the average execution times by 9.3%.