Line (block) size choice for CPU cache memories
IEEE Transactions on Computers
On the inclusion properties for multi-level cache hierarchies
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A simulation study of two-level caches
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Clipper processor: instruction set architecture and implementation
Communications of the ACM
Mache: no-loss trace compaction
SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache and memory hierarchy design: a performance-directed approach
Cache and memory hierarchy design: a performance-directed approach
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Classification and performance evaluation of instruction buffering techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Evaluating performance of prefetching second level caches
ACM SIGMETRICS Performance Evaluation Review
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Threaded prefetching: an adaptive instruction prefetch mechanism
Microprocessing and Microprogramming
Compiler-directed data prefetching in multiprocessors with memory hierarchies
ICS '90 Proceedings of the 4th international conference on Supercomputing
Sequentiality and prefetching in database systems
ACM Transactions on Database Systems (TODS)
Characterizing the Storage Process and Its Effect on the Update of Main Memory by Write Through
Journal of the ACM (JACM)
Cache evaluation and the impact of workload choice
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
ACM Computing Surveys (CSUR)
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Evaluation of cache consistency algorithm performance
MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
The Memory Architecture and the Cache and Memory Management Unit for
The Memory Architecture and the Cache and Memory Management Unit for
Performance Evaluation of Cache Prefetching Strategies
Performance Evaluation of Cache Prefetching Strategies
Cache performance for multimedia applications
ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution history guided instruction prefetching
ICS '02 Proceedings of the 16th international conference on Supercomputing
DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia processors
ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
Journal of Computer Science and Technology
Execution History Guided Instruction Prefetching
The Journal of Supercomputing
A PAB-based multi-prefetcher mechanism
International Journal of Parallel Programming
A small data cache for multimedia-oriented embedded systems
Journal of Systems Architecture: the EUROMICRO Journal
Two management approaches of the split data cache in multiprocessor systems
EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture
Journal of Parallel and Distributed Computing
Cost Minimization with HPDFG and Data Mining for Heterogeneous DSP
Journal of Signal Processing Systems
Explicit reservation of cache memory in a predictable, preemptive multitasking real-time system
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 14.98 |
Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.