Memory Access Dependencies in Shared-Memory Multiprocessors
IEEE Transactions on Software Engineering
The Stanford Dash Multiprocessor
Computer
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Journal of Parallel and Distributed Computing
Future Generation Computer Systems
Essential misses and data traffic in coherence protocols
Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Two Adaptive Hybrid Cache Coherency Protocols
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
IEEE Transactions on Parallel and Distributed Systems
A holistic approach to computer system design education based on system simulation techniques
WCAE '98 Proceedings of the 1998 workshop on Computer architecture education
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Hi-index | 4.10 |
Shared memory multiprocessors make it practical to convert sequential programs to parallel ones in a variety of applications. An emerging class of shared memory multiprocessors are nonuniform memory access machines with private caches and a cache coherence protocol. Proposed hardware optimizations to CC-NUMA machines can shorten the time processors lose because of cache misses and invalidations. The authors look at cost-performance trade-offs for each of four proposed optimizations: release consistency, adaptive sequential prefetching, migratory sharing detection, and hybrid update/invalidate with a write cache. The four optimizations differ with respect to which application features they attack, what hardware resources they require, and what constraints they impose on the application software. The authors measured the degree of performance improvement using the four optimizations in isolation and in combination, looking at the trade-offs in hardware and programming complexities. Although one combination of the proposed optimizations (prefetching and migratory sharing detection) can boost a sequentially consistent machine to perform as well as a machine with release consistency, release consistency models offer significant performance improvements across a broad application domain at little extra complexity in the machine design. Moreover, a combination of sequential prefetching and hybrid update/invalidate with a write cache cuts the execution time of a sequentially consistent machine by half with fairly modest changes to the second-level cache and the cache protocol. The authors expect that designers will begin to turn more to the release consistency model.