CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

Authors:
John Tse;Alan Jay Smith
Affiliations:
Altera Corp., El Cerrito, CA;Univ. of California, Berkeley
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 22
Cited 11

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
On the inclusion properties for multi-level cache hierarchies

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A simulation study of two-level caches

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Clipper processor: instruction set architecture and implementation

Communications of the ACM
Mache: no-loss trace compaction

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache and memory hierarchy design: a performance-directed approach

Cache and memory hierarchy design: a performance-directed approach
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Classification and performance evaluation of instruction buffering techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Evaluating performance of prefetching second level caches

ACM SIGMETRICS Performance Evaluation Review
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Threaded prefetching: an adaptive instruction prefetch mechanism

Microprocessing and Microprogramming
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Sequentiality and prefetching in database systems

ACM Transactions on Database Systems (TODS)
Characterizing the Storage Process and Its Effect on the Update of Main Memory by Write Through

Journal of the ACM (JACM)
Cache evaluation and the impact of workload choice

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
Cache Performance of the SPEC92 Benchmark Suite

IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Evaluation of cache consistency algorithm performance

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
The Memory Architecture and the Cache and Memory Management Unit for

The Memory Architecture and the Cache and Memory Management Unit for
Performance Evaluation of Cache Prefetching Strategies

Performance Evaluation of Cache Prefetching Strategies

Cache performance for multimedia applications

ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution history guided instruction prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia processors

ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
IPULOC - exploring dynamic program locality with the instruction processing unit for filling memory gap

Journal of Computer Science and Technology
Execution History Guided Instruction Prefetching

The Journal of Supercomputing
A PAB-based multi-prefetcher mechanism

International Journal of Parallel Programming
A small data cache for multimedia-oriented embedded systems

Journal of Systems Architecture: the EUROMICRO Journal
Two management approaches of the split data cache in multiprocessor systems

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Journal of Parallel and Distributed Computing
Cost Minimization with HPDFG and Data Mining for Heterogeneous DSP

Journal of Signal Processing Systems
Explicit reservation of cache memory in a predictable, preemptive multitasking real-time system

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	14.98

Visualization

Abstract

Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.