Adjustable block size coherent caches

Authors:
Czarek Dubnicki;Thomas J. LeBlanc
Affiliations:
-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 9
Cited 29

PRESTO: a system for object-oriented parallel programming

Software—Practice & Experience
A characterization of sharing in parallel programs and its application to coherency protocol evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

A performance evaluation of optimal hybrid cache coherency protocols

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Exploiting the parallelism available in loops

Computer
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Compiler and hardware support for cache coherence in large-scale multiprocessors: design considerations and performance study

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A cost-comparison approach for adaptive distributed shared memory

ICS '96 Proceedings of the 10th international conference on Supercomputing
Tradeoffs between false sharing and aggregation in software distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing Area Cost of On-Chip Cache Memories by Caching Address Tags

IEEE Transactions on Computers
A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches

The Journal of Supercomputing - Special issue: high performance computing systems
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback

ACM Transactions on Computer Systems (TOCS)
Hardware and Compiler-Directed Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study

IEEE Transactions on Parallel and Distributed Systems
A compiler-directed cache coherence scheme with improved intertask locality

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Two techniques for improving performance on bus-based multiprocessors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Analysis of Shared Memory Misses and Reference Patterns

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
VL-CDRAM: variable line sized cached DRAMs

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
Unbounded page-based transactional memory

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Increasing cache capacity through word filtering

Proceedings of the 21st annual international conference on Supercomputing
False sharing and its effect on shared memory performance

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
DeFT: Design space exploration for on-the-fly detection of coherence misses

ACM Transactions on Architecture and Code Optimization (TACO)
Edge chasing delayed consistency: pushing the limits of weak memory models

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Protozoa: adaptive granularity cache coherence

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the number of cache invalidations, but increase the nuber of bus or network transactions required to load data into the cache. In this paper we describe a cache organization that dynamically adjusts the cache block size according to recently observed reference behavior. Cache blocks are split across cache lines when false sharing occurs, ad merged back into a single cache line to explit spatial locality. To evaluate this cache organization, we simulate a scalable multiprocessor with coherent caches, using a suite of memory reference traces to model program behavior. We show that for evry fixed block size, some program suffers a 33% increase in the average waiting time per reference, and a factor of 2 increase in the average number of words transferred per reference, when compared against the performance of an adjustable block size cache. In the few cases where adjusting the block size does not provide superior performance, it comes within 7% of the best fixed block size alternative. We conclude that an adjustable block size cache offers significantly better performance than every fixed block size cache, especially when there is variability in the granularity of sharing exhibited by applications.