Coherency for multiprocessor virtual address caches
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Evaluating the performance of four snooping cache coherency protocols
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Adjustable block size coherent caches
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons
ACM Computing Surveys (CSUR)
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Techniques for reducing consistency-related communication in distributed shared-memory systems
ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
Analysis of Multiprocessor Memory Refernce Behavior
ICCS '94 Proceedings of the1994 IEEE International Conference on Computer Design: VLSI in Computer & Processors
A dynamic cache sub-block design to reduce false sharing
ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Two techniques for improving performance on bus-based multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Evaluation of cache consistency algorithm performance
MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Multiprocessor Memory Reference Generation Using Cerberus
MASCOTS '99 Proceedings of the 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
SPLASH: Stanford parallel applications for shared-memory*
SPLASH: Stanford parallel applications for shared-memory*
Analysis of Shared Memory Misses and Reference Patterns
Analysis of Shared Memory Misses and Reference Patterns
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
SCIMA-SMP: on-chip memory processor architecture for SMP
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Hi-index | 0.00 |
Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck, which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing plus cold start misses. Regardless of the cause of cache misses, the largest fractions of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing.We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our result show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before invalidation to the block occurs. Approximately 50 per-cent of blocks written and subsequently read by other caches show no use of the newly written information before the block is again invalidated.In addition to our general analysis of reference patterns, we also present a detailed analysis of dead sharing for each shared memory multiprocessor program studied. We find that the worst 10 blocks (based on most total misses) from each of our traces contribute almost 50 percent of the false sharing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.