A study of scalar compilation techniques for pipelined supercomputers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
An architecture for software-controlled data prefetching
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Software support for speculative loads
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Reducing memory latency via non-blocking and prefetching caches
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
The Architecture of Symbolic Computers
The Architecture of Symbolic Computers
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Hi-index | 0.00 |
In this paper, we introduce a simple hardware mechanism supporting non-blocking loads in conjunction with lockup-free caches to hide memory latencies in high-performance processors. The cache and processor cooperate on load misses so that the overall complexity of the non-blocking mechanisms in the cache and in the processor is greatly reduced. We use detailed simulations to evaluate the effectiveness of the architecture and of a simple compiler transformation at hiding miss latencies of up to 200 processor cycles. For a given program we identify a critical latency. For latencies lower than this critical latency, the non-blocking processor/cache architecture achieves perfect memory latency tolerance by overlapping misses with processor execution. For higher latencies, significant improvements in processor efficiency are still obtained by overlapping multiple misses together. A simple model is used to illustrate this effect and improvements are proposed based on the results.