SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
System support for automatic profiling and optimization
Proceedings of the sixteenth ACM symposium on Operating systems principles
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Design and implementation of the NUMAchine multiprocessor
DAC '98 Proceedings of the 35th annual Design Automation Conference
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Inside Windows NT
Scalable Shared-Memory Multiprocessing
Scalable Shared-Memory Multiprocessing
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Hi-index | 0.00 |
Commercial cache-coherent nonuniform memory access (ccNUMA) systems often require extensive investments in hardware design and operating system support. A different approach to building these systems is to use Standard High Volume (SHV) hardware and stock software components as building blocks and assemble them with minimal investments in hardware and software. This design approach trades the performance advantages of specialized hardware design for simplicity and implementation speed, and relies on application-level tuning for scalability and performance. We present our experience with this approach in this paper. We built a 16-way ccNUMA Intel system consisting of four commodity four-processor Fujitsu® Teamserver™ SMPs connected by a Synfinity™ cache-coherent switch. The system features a total of sixteen 350-MHz Intel® Xeon™ processors and 4 GB of physical memory, and runs the standard commercial Microsoft Windows NT® operating system. The system can be partitioned statically or dynamically, and uses an innovative, combined hardware/software approach to support application-level performance tuning. On the hardware side, a programmable performance-monitor card measures the frequency of remote-memory accesses, which constitute the predominant source of performance overhead. The monitor does not cause any performance overhead and can be deployed in production mode, providing the possibility for dynamic performance tuning if the application workload changes over time. On the software side, the Resource Set abstraction allows application-level threads to improve performance and scalability by specifying their execution and memory affinity across the ccNUMA system. Results from a performance-evaluation study confirm the success of the combined hardware/software approach for performance tuning in computation-intensive workloads. The results also show that the poor local-memory bandwidth in commodity Intel-based systems, rather than the latency of remote-memory access, is often the main contributor to poor scalability and performance. The contributions of this work can be summarized as follows: • The Resource Set abstraction allows control over resource allocation in a portable manner across ccNUMA architectures; we describe how it was implemented without modifying the operating system. • An innovative hardware design for a programmable performance-monitor card is designed specifically for a ccNUMA environment and allows dynamic, adaptive performance optimizations. • A performance study shows that performance and scalability are often limited by the local-memory bandwidth rather than by the effects of remote-memory access in an Intel-based architecture.