Experience with building a commodity intel-based ccNUMA system

  • Authors:
  • B. C. Brock;G. D. Carpenter;E. Chiprout;M. E. Dean;P. L. De Backer;E. N. Elnozahy;H. Franke;M. E. Giampapa;D. Glasco;J. L. Peterson;R. Rajamony;R. Ravindran;F. L. Rawson;R. L. Rockhold;J. Rubio

  • Affiliations:
  • IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;Intel Corporation, Chandler, Arizona;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;Newisys, Inc., Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Research Division, Austin Research Laboratory, Austin, Texas;IBM Global Services India Pvt. Limited;IBM Research Division, Austin Research Laboratory, Austin, Texas;WhisperWire, Inc., Austin, Texas;The University of Texas at Austin, Department of Electrical and Computer Engineering, Austin, Texas

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Commercial cache-coherent nonuniform memory access (ccNUMA) systems often require extensive investments in hardware design and operating system support. A different approach to building these systems is to use Standard High Volume (SHV) hardware and stock software components as building blocks and assemble them with minimal investments in hardware and software. This design approach trades the performance advantages of specialized hardware design for simplicity and implementation speed, and relies on application-level tuning for scalability and performance. We present our experience with this approach in this paper. We built a 16-way ccNUMA Intel system consisting of four commodity four-processor Fujitsu® Teamserver™ SMPs connected by a Synfinity™ cache-coherent switch. The system features a total of sixteen 350-MHz Intel® Xeon™ processors and 4 GB of physical memory, and runs the standard commercial Microsoft Windows NT® operating system. The system can be partitioned statically or dynamically, and uses an innovative, combined hardware/software approach to support application-level performance tuning. On the hardware side, a programmable performance-monitor card measures the frequency of remote-memory accesses, which constitute the predominant source of performance overhead. The monitor does not cause any performance overhead and can be deployed in production mode, providing the possibility for dynamic performance tuning if the application workload changes over time. On the software side, the Resource Set abstraction allows application-level threads to improve performance and scalability by specifying their execution and memory affinity across the ccNUMA system. Results from a performance-evaluation study confirm the success of the combined hardware/software approach for performance tuning in computation-intensive workloads. The results also show that the poor local-memory bandwidth in commodity Intel-based systems, rather than the latency of remote-memory access, is often the main contributor to poor scalability and performance. The contributions of this work can be summarized as follows: • The Resource Set abstraction allows control over resource allocation in a portable manner across ccNUMA architectures; we describe how it was implemented without modifying the operating system. • An innovative hardware design for a programmable performance-monitor card is designed specifically for a ccNUMA environment and allows dynamic, adaptive performance optimizations. • A performance study shows that performance and scalability are often limited by the local-memory bandwidth rather than by the effects of remote-memory access in an Intel-based architecture.