Experimentally Characterizing the Behavior of Multiprocessor Memory Systems: A Case Study
IEEE Transactions on Software Engineering
Micro benchmark analysis of the KSR1
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The GLOW cache coherence protocol extensions for widely shared data
ICS '96 Proceedings of the 10th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Characterizing Distributed Shared Memory Performance: A Case Study of the Convex SPP1000
IEEE Transactions on Parallel and Distributed Systems
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Scalable Shared-Memory Multiprocessing
Scalable Shared-Memory Multiprocessing
Advanced performance features of the 64-bit PA-8000
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
A highly scalable system utilizing up to 128 PA-RISC processors
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
The evolution of the HP/Convex Exemplar
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Communications of the ACM
Scaling application performance on a cache-coherent multiprocessor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ICS '99 Proceedings of the 13th international conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
SMTp: An Architecture for Next-generation Scalable Multi-threading
Proceedings of the 31st annual international symposium on Computer architecture
International Journal of Parallel Programming
Hi-index | 0.02 |
Advances in microarchitecture, packaging, and manufacturing processes enable designers to build new systems with higher performance and scalability. Using microbenchmark techniques, we contrast the memory and communication performance of two generations of the HP/Convex Exemplar scalable parallel processing system. The SPP1000 and SPP2000 have significant architectural and implementation differences, but maintain upward binary compatibility. The SPP2000 employs manufacturing and packaging advances to obtain shorter system interconnects with wider data paths and improved functionality, thereby reducing the latency and increasing the bandwidth of remote communication. Although the memory latency is not significantly improved, newer out-of-order execution processors coupled with nonblocking caches achieve much higher memory bandwidth. The SPP2000 has a richer system interconnect topology that allows scalability to a larger number of processors. The SPP2000 also employs innovations in its coherence protocols to improve synchronization and communication performance. This paper characterizes the performance effects of these changes, and identifies some remaining inefficiencies, in the cache coherence protocol and the node configuration, that future systems should address.