Limitations of cache prefetching on a bus-based multiprocessor

Authors:
Dean M. Tullsen;Susan J. Eggers
Affiliations:
-;-
Venue:
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Year:
1993

Citing 19
Cited 28

Logic verification algorithms and their parallel implementation

DAC '87 Proceedings of the 24th ACM/IEEE Design Automation Conference
Techniques for efficient inline tracing on a shared-memory multiprocessor

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Lockup-free caches in high-performance multiprocessors

Journal of Parallel and Distributed Computing
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Computer Technology and Architecture: An Evolving Interaction

Computer
Simplicity Versus Accuracy in a Model of Cache Coherency Overhead

IEEE Transactions on Computers
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
MC88100 Microprocessors User's Manual

MC88100 Microprocessors User's Manual
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
Computing Per-Process Summary Side-Effect Information

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory
Simulation Analysis Data Sharing in Shared Memory Multiprocessors

Simulation Analysis Data Sharing in Shared Memory Multiprocessors

A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A unified architectural tradeoff methodology

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Effective cache prefetching on bus-based multiprocessors

ACM Transactions on Computer Systems (TOCS)
The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics

IEEE Transactions on Parallel and Distributed Systems
A study of integrated prefetching and caching strategies

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of multithreaded uniprocessors for commercial application environments

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler and hardware support for cache coherence in large-scale multiprocessors: design considerations and performance study

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling

ACM Transactions on Computer Systems (TOCS)
Locality As a Visualization Tool

IEEE Transactions on Computers
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Evaluation of high performance multicache parallel texture mapping

ICS '98 Proceedings of the 12th international conference on Supercomputing
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

IEEE Transactions on Computers
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

International Journal of Parallel Programming
Empirical investigation of the Markov reference model

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Hardware spatial forwarding for widely shared data

Proceedings of the 14th international conference on Supercomputing
On Interaction between Interconnection Network Design and Latency Hiding Techniques in Multiprocessors

The Journal of Supercomputing
Hardware and Compiler-Directed Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study

IEEE Transactions on Parallel and Distributed Systems
Avoiding initialization misses to the heap

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Achieving High Performance in Bus-Based Shared-Memory Multiprocessors

IEEE Concurrency
Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Analyzing memory access intensity in parallel programs on multicore

Proceedings of the 22nd annual international conference on Supercomputing
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.