The performance impact of block sizes and fetch strategies

Authors:
Steven Przybylski
Affiliations:
MIPS Computer Systems, Sunnyvale, CA
Venue:
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Year:
1990

Citing 14
Cited 27

ATUM: a new technique for capturing address traces using microcode

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Analysis of cache performance for operating systems and multiprogramming

Analysis of cache performance for operating systems and multiprogramming
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Cache and memory hierarchy design: a performance-directed approach

Cache and memory hierarchy design: a performance-directed approach
Sequentiality and prefetching in database systems

ACM Transactions on Database Systems (TODS)
Characterizing the Storage Process and Its Effect on the Update of Main Memory by Write Through

Journal of the ACM (JACM)
Cache evaluation and the impact of workload choice

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
Bibliography and reading on CPU cache memories and related topics

ACM SIGARCH Computer Architecture News
The effect of instruction fetch strategies upon the performance of pipelined instruction units

ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Sequential prefetch strategies for instructions and data

Sequential prefetch strategies for instructions and data
The Memory Architecture and the Cache and Memory Management Unit for

The Memory Architecture and the Cache and Memory Management Unit for

Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
A study of I/O system organizations

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating performance of prefetching second level caches

ACM SIGMETRICS Performance Evaluation Review
A unified architectural tradeoff methodology

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hardware implementation issues of data prefetching

ICS '95 Proceedings of the 9th international conference on Supercomputing
Architecture Technique Trade-Offs Using Mean Memory Delay Time

IEEE Transactions on Computers
Trap-driven memory simulation with Tapeworm II

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Run-time spatial locality detection and optimization

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting spatial locality in data caches using spatial footprints

Proceedings of the 25th annual international symposium on Computer architecture
Functional Implementation Techniques for CPU Cache Memories

IEEE Transactions on Computers - Special issue on cache memory and related problems
The pool of subsectors cache design

ICS '99 Proceedings of the 13th international conference on Supercomputing
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Locality vs. criticality

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
When Caches Aren't Enough: Data Prefetching Techniques

Computer
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

IEEE Transactions on Parallel and Distributed Systems
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
A new NAND-type flash memory package with smart buffer system for spatial and temporal localities

Journal of Systems Architecture: the EUROMICRO Journal
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Increasing cache capacity through word filtering

Proceedings of the 21st annual international conference on Supercomputing
Next high performance and low power flash memory package structure

Journal of Computer Science and Technology
Application-Specific hardware-driven prefetching to improve data cache performance

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper explores the interactions between a cache's block size, fetch size and fetch policy from the perspective of maximizing system-level performance. It has been previously noted that given a simple fetch strategy the performance optimal block size is almost always four or eight words [10]. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance-optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of two greater than the data cache fetch size, which in turn should equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated here, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above.