Streamlining data cache access with fast address calculation
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Zero-cycle loads: microarchitecture support for reducing load latency
Proceedings of the 28th annual international symposium on Microarchitecture
Microarchitecture support for improving the performance of load target prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Multilevel Optimization of Pipelined Caches
IEEE Transactions on Computers
Tolerating latency in multiprocessors through compiler-inserted prefetching
ACM Transactions on Computer Systems (TOCS)
Correlated load-address predictors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Early load address resolution via register tracking
Proceedings of the 27th annual international symposium on Computer architecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach
ACM Transactions on Computer Systems (TOCS)
The Alpha 21264 Microprocessor
IEEE Micro
Pointer cache assisted prefetching
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic memory instruction bypassing
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Predictive sequential associative cache
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Reducing data cache energy consumption via cached load/store queue
Proceedings of the 2003 international symposium on Low power electronics and design
On load latency in low-power caches
Proceedings of the 2003 international symposium on Low power electronics and design
Microprocessor pipeline energy analysis
Proceedings of the 2003 international symposium on Low power electronics and design
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A study of source-level compiler algorithms for automatic construction of pre-execution code
ACM Transactions on Computer Systems (TOCS)
Signature Buffer: Bridging Performance Gap between Registers and Caches
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Snug set-associative caches: reducing leakage power while improving performance
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Reducing latencies of pipelined cache accesses through set prediction
Proceedings of the 19th annual international conference on Supercomputing
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Decomposing the load-store queue by function for power reduction and scalability
IBM Journal of Research and Development
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
ACM Transactions on Architecture and Code Optimization (TACO)
Reducing non-deterministic loads in low-power caches via early cache set resolution
Microprocessors & Microsystems
Reducing cache misses through programmable decoders
ACM Transactions on Architecture and Code Optimization (TACO)
Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)
Hi-index | 0.01 |
Effective address calculations for load and store instructions need to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations by storing computed effective addresses of previous load/store instructions in a dummy register file. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. Furthermore, when multiple dummy register files are deployed, this fast address generator eliminates over 90% of effective address computations of load and store instructions and improves the average execution times by 9.3%.