Experience with building a commodity intel-based ccNUMA system

Authors:
B. C. Brock;G. D. Carpenter;E. Chiprout;M. E. Dean;P. L. De Backer;E. N. Elnozahy;H. Franke;M. E. Giampapa;D. Glasco;J. L. Peterson;R. Rajamony;R. Ravindran;F. L. Rawson;R. L. Rockhold;J. Rubio
Affiliations:
IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;Intel Corporation, Chandler, Arizona;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;Newisys, Inc., Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Global Services India Pvt. Limited;IBM Research Division, Austin Research Laboratory, Austin, Texas;WhisperWire, Inc., Austin, Texas;The University of Texas at Austin, Department of Electrical and Computer Engineering, Austin, Texas
Venue:
IBM Journal of Research and Development
Year:
2001

Citing 21
Cited 0

SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
System support for automatic profiling and optimization

Proceedings of the sixteenth ACM symposium on Operating systems principles
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Design and implementation of the NUMAchine multiprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Inside Windows NT

Inside Windows NT
Scalable Shared-Memory Multiprocessing

Scalable Shared-Memory Multiprocessing
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Windows NT in a ccNUMA system

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Commercial cache-coherent nonuniform memory access (ccNUMA) systems often require extensive investments in hardware design and operating system support. A different approach to building these systems is to use Standard High Volume (SHV) hardware and stock software components as building blocks and assemble them with minimal investments in hardware and software. This design approach trades the performance advantages of specialized hardware design for simplicity and implementation speed, and relies on application-level tuning for scalability and performance. We present our experience with this approach in this paper. We built a 16-way ccNUMA Intel system consisting of four commodity four-processor Fujitsu® Teamserver™ SMPs connected by a Synfinity™ cache-coherent switch. The system features a total of sixteen 350-MHz Intel® Xeon™ processors and 4 GB of physical memory, and runs the standard commercial Microsoft Windows NT® operating system. The system can be partitioned statically or dynamically, and uses an innovative, combined hardware/software approach to support application-level performance tuning. On the hardware side, a programmable performance-monitor card measures the frequency of remote-memory accesses, which constitute the predominant source of performance overhead. The monitor does not cause any performance overhead and can be deployed in production mode, providing the possibility for dynamic performance tuning if the application workload changes over time. On the software side, the Resource Set abstraction allows application-level threads to improve performance and scalability by specifying their execution and memory affinity across the ccNUMA system. Results from a performance-evaluation study confirm the success of the combined hardware/software approach for performance tuning in computation-intensive workloads. The results also show that the poor local-memory bandwidth in commodity Intel-based systems, rather than the latency of remote-memory access, is often the main contributor to poor scalability and performance. The contributions of this work can be summarized as follows: • The Resource Set abstraction allows control over resource allocation in a portable manner across ccNUMA architectures; we describe how it was implemented without modifying the operating system. • An innovative hardware design for a programmable performance-monitor card is designed specifically for a ccNUMA environment and allows dynamic, adaptive performance optimizations. • A performance study shows that performance and scalability are often limited by the local-memory bandwidth rather than by the effects of remote-memory access in an Intel-based architecture.