SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Starfire: Extending the SMP Envelope
IEEE Micro
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Using Switch Directories to Speed Up Cache-to-Cache Transfers in CC-NUMA Multiprocessors
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A New Scalable Directory Architecture for Large-Scale Multiprocessors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Hi-index | 0.00 |
Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller and the network interface. In this work we exploit such integration scale, presenting a new three-level directory architecture aimed at reducing the long L2 miss latencies and the memory overhead that characterize cc-NUMA machines and limit their scalability. The proposed architecture is based on the integration into the processor chip of the directory controller and a small first-level directory cache that stores precise information for the most recently referenced memory lines, as the means to reduce miss latencies. The second- and third-level directories are located near main memory and they are only accessed when a directory entry for a certain memory line is not present in the first-level directory. This off-chip structure achieves the performance of a big and nonscalable full-map directory with a very significant reduction in the memory overhead. Using execution-driven simulations, we show that substantial latency reductions can be obtained by using the proposed directory architecture. Load, store and read-modify-write misses are significantly accelerated (latency reductions of more than 35% in some cases). These reductions translate into important improvements on the final application performance (reductions up to 20% in execution time).