Two-level adaptive training branch prediction
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Supporting dynamic data structures on distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
SPAID: software prefetching in pointer- and call-intensive environments
Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Correlated load-address predictors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Effective jump-pointer prefetching for linked data structures
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Push vs. pull: data movement for linked data structures
Proceedings of the 14th international conference on Supercomputing
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A stateless, content-directed data prefetching mechanism
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Pointer cache assisted prefetching
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
FlexRAM: Toward an Advanced Intelligent Memory System
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Hi-index | 0.00 |
Indirect memory accesses, where a load is fed by another load, are ubiquitous because of rich data structures and sophisticated software conventions, such as the use of linkage tables and position independent code. Unfortunately, they can be costly: if both loads miss, two round trips to memory are required even though the role of the first load is often limited to fetching the address of the second load. To reduce the total latency of such indirect accesses, a new instruction called load squared is introduced. A load squared does two fetches, the first fetch reading the target address of the second. (An offset is optionally added to the result of the first fetch.) The load squared operation is performed by memory-side logic (typically, the memory controller if it isn't located on the main processor chip). In this study, load squared is not an architecturally visible instruction: the micro-architecture transparently decides which loads should be replaced by loads squared. We show that performance is sometimes improved significantly, and never degraded.