Multiprocessor cache design considerations

Authors:
R. L. Lee;P. C. Yew;D. H. Lawrie
Affiliations:
Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, IL;Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, IL;Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Year:
1987

Citing 12
Cited 23

Cache evaluation and the impact of workload choice

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
Cache Performance in the VAX-11/780

ACM Transactions on Computer Systems (TOCS)
Introduction

Proceedings of the Tutorial and Workshop on Category Theory and Computer Programming
Using cache memory to reduce processor-memory traffic

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
A study of instruction cache organizations and replacement policies

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Experimental evaluation of on-chip microprocessor cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An economical solution to the cache coherence problem

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Compiler memory management and compound function definition for multiprocessors

Compiler memory management and compound function definition for multiprocessors

A cache coherence scheme with fast selective invalidation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The design of a lockup-free cache for high-performance multiprocessors

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
A software coherence scheme with the assistance of directories

ICS '91 Proceedings of the 5th international conference on Supercomputing
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Delayed consistency and its effects on the miss rate of parallel programs

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Comparison and analysis of software and directory coherence schemes

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design choices for the TOP-1 multiprocessor workstation

IBM Journal of Research and Development
Life span strategy—a compiler-based approach to cache coherence

ICS '92 Proceedings of the 6th international conference on Supercomputing
An effective write policy for software coherence schemes

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons

ACM Computing Surveys (CSUR)
Cache inclusion and processor sampling in multiprocessor simulations

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A version control approach to Cache coherence

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Classifying Software-Based Cache Coherence Solutions

IEEE Software
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps

IEEE Transactions on Parallel and Distributed Systems
Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Improving Memory Utilization in Cache Coherence Directories

IEEE Transactions on Parallel and Distributed Systems
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

IEEE Transactions on Parallel and Distributed Systems
Exploiting locality to ameliorate packet queue contention and serialization

Proceedings of the 3rd conference on Computing frontiers
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Lazy cache invalidation for self-modifying codes

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network. The majority of the multiprocessor cache studies in the literature exclusively focus on the issue of cache coherence enforcement. However, there are other characteristics unique to such multiprocessors which create an environment for cache performance that is very different from that of many uniprocessors.Multiprocessor conditions are identified and modeled, including, 1) the cost of a cache coherence enforcement scheme, 2) the effect of a high degree of overlap between cache miss services, 3) the cost of a pin limited data path between shared memory and caches, 4) the effect of a high degree of data prefetching, 5) the program behavior of a scientific workload as represented by 23 numerical subroutines, and 6) the parallel execution of programs. This model is used to show that the cache miss ratio is not a suitable performance measure in the multiprocessors of interest and to show that the optimal cache block size in such multiprocessors is much smaller than in many uniprocessors.