PRESTO: a system for object-oriented parallel programming
Software—Practice & Experience
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The VMP multiprocessor: initial experience, refinements, and performance evaluation
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
NUMA policies and their relation to memory architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Comparative evaluation of latency reducing and tolerating techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
A performance evaluation of optimal hybrid cache coherency protocols
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Performance evaluation of hybrid hardware and software distributed shared memory protocols
ICS '94 Proceedings of the 8th international conference on Supercomputing
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A cost-comparison approach for adaptive distributed shared memory
ICS '96 Proceedings of the 10th international conference on Supercomputing
Tradeoffs between false sharing and aggregation in software distributed shared memory
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing Area Cost of On-Chip Cache Memories by Caching Address Tags
IEEE Transactions on Computers
A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches
The Journal of Supercomputing - Special issue: high performance computing systems
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
A compiler-directed cache coherence scheme with improved intertask locality
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Two techniques for improving performance on bus-based multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Analysis of Shared Memory Misses and Reference Patterns
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
VL-CDRAM: variable line sized cached DRAMs
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
Proceedings of the 32nd annual international symposium on Computer Architecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Unbounded page-based transactional memory
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Increasing cache capacity through word filtering
Proceedings of the 21st annual international conference on Supercomputing
False sharing and its effect on shared memory performance
Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
Adaptive line size cache for irregular references on cell multicore processor
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
DeFT: Design space exploration for on-the-fly detection of coherence misses
ACM Transactions on Architecture and Code Optimization (TACO)
Edge chasing delayed consistency: pushing the limits of weak memory models
Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Protozoa: adaptive granularity cache coherence
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the number of cache invalidations, but increase the nuber of bus or network transactions required to load data into the cache. In this paper we describe a cache organization that dynamically adjusts the cache block size according to recently observed reference behavior. Cache blocks are split across cache lines when false sharing occurs, ad merged back into a single cache line to explit spatial locality. To evaluate this cache organization, we simulate a scalable multiprocessor with coherent caches, using a suite of memory reference traces to model program behavior. We show that for evry fixed block size, some program suffers a 33% increase in the average waiting time per reference, and a factor of 2 increase in the average number of words transferred per reference, when compared against the performance of an adjustable block size cache. In the few cases where adjusting the block size does not provide superior performance, it comes within 7% of the best fixed block size alternative. We conclude that an adjustable block size cache offers significantly better performance than every fixed block size cache, especially when there is variability in the granularity of sharing exhibited by applications.