ATUM: a new technique for capturing address traces using microcode
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An evaluation of directory schemes for cache coherence
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Blocking: exploiting spatial locality for trace compaction
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The effect of context switches on cache performance
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford Dash Multiprocessor
Computer
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Characterizing the caching and synchronization performance of a multiprocessor operating system
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The impact of operating system structure on memory system performance
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Architectural support for performance tuning: a case study on the SPARCcenter 2000
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software-extended coherent shared memory: performance and cost
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Generation and analysis of very long address traces
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The DASH Prototype: Logic Overhead and Performance
IEEE Transactions on Parallel and Distributed Systems
Scalable Reader-Writer Locks for Parallel Systems
IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Searching for the sorting record: experiences in tuning NOW-Sort
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies
IEEE Transactions on Computers
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols
IEEE Transactions on Computers
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
Coherent Block Data Transfer in the FLASH Multiprocessor
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Performance Analysys of a CC-NUMAOperating System
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Boosting the Performance of Three-Tier Web Servers Deploying SMP Architecture
Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Improving the Data Cache Performance of Multiprocessor Operating Systems
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Journal of Parallel and Distributed Computing
Experience distributing objects in an SMMP OS
ACM Transactions on Computer Systems (TOCS)
Speeding-up multiprocessors running DBMS workloads through coherence protocols
International Journal of High Performance Computing and Networking
Enhancing operating system support for multicore processors by using hardware performance monitoring
ACM SIGOPS Operating Systems Review
Locating cache performance bottlenecks using data profiling
Proceedings of the 5th European conference on Computer systems
Hi-index | 0.01 |
This study characterizes the performance of a variant of UNIX SVR4 on a large shared-memory multiprocessor and analyzes the effects of possible OS and architectural changes. We use a nonintrusive cache miss monitor to trace the execution of an OS-intensive multiprogrammed workload on the Stanford DASH, a 32-CPU CC-NUMA multiprocessor (CC-NUMA multiprocessors have cache-coherent shared memory that is physically distributed across the machine). We find that our version of UNIX accounts for 24% of the workload's total execution time. A surprisingly large fraction of OS time (79%) is spent on memory system stalls, divided equally between instruction and data cache miss time. In analyzing techniques to reduce instruction cache miss stall time, we find that replication of only 7% of the OS code would allow 80% of instruction cache misses to be serviced locally on a CC-NUMA machine. For data cache misses, we find that a small number of routines account for 96% of OS data cache stall time. We find that most of these misses are coherence (communication) misses, and larger caches will not necessarily help. After presenting detailed performance data, we analyze the benefits of several OS changes and predict the effects of altering the cache configuration, degree of clustering, and cache coherence mechanism of the machine. (This paper is available via http://wwwflash.stanford.edu.)