Memory system performance of UNIX on CC-NUMA multiprocessors

  • Authors:
  • John Chapin;A. Herrod;Mendel Rosenblum;Anoop Gupta

  • Affiliations:
  • Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA

  • Venue:
  • Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
  • Year:
  • 1995

Quantified Score

Hi-index 0.01

Visualization

Abstract

This study characterizes the performance of a variant of UNIX SVR4 on a large shared-memory multiprocessor and analyzes the effects of possible OS and architectural changes. We use a nonintrusive cache miss monitor to trace the execution of an OS-intensive multiprogrammed workload on the Stanford DASH, a 32-CPU CC-NUMA multiprocessor (CC-NUMA multiprocessors have cache-coherent shared memory that is physically distributed across the machine). We find that our version of UNIX accounts for 24% of the workload's total execution time. A surprisingly large fraction of OS time (79%) is spent on memory system stalls, divided equally between instruction and data cache miss time. In analyzing techniques to reduce instruction cache miss stall time, we find that replication of only 7% of the OS code would allow 80% of instruction cache misses to be serviced locally on a CC-NUMA machine. For data cache misses, we find that a small number of routines account for 96% of OS data cache stall time. We find that most of these misses are coherence (communication) misses, and larger caches will not necessarily help. After presenting detailed performance data, we analyze the benefits of several OS changes and predict the effects of altering the cache configuration, degree of clustering, and cache coherence mechanism of the machine. (This paper is available via http://wwwflash.stanford.edu.)