Improving the Data Cache Performance of Multiprocessor Operating Systems

Authors:
Chun Xia;Josep Torrellas
Affiliations:
-;-
Venue:
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Year:
1996

Citing 13
Cited 6

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Cache performance of operating system and multiprogramming workloads

ACM Transactions on Computer Systems (TOCS)
The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Characterizing the caching and synchronization performance of a multiprocessor operating system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The impact of operating system structure on memory system performance

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Avoiding conflict misses dynamically in large direct-mapped caches

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Memory system performance of UNIX on CC-NUMA multiprocessors

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Optimizing instruction cache performance for operating system intensive workloads

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

Instruction prefetching of systems codes with layout optimized for reduced cache misses

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies

IEEE Transactions on Computers
PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An analysis of operating system behavior on a simultaneous multithreaded architecture

ACM SIGPLAN Notices
An analysis of operating system behavior on a simultaneous multithreaded architecture

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measurements have consistently indicated that the operating system uses the data cache hierarchy poorly. In this paper, we address the issue of how to eliminate most of the data cache misses in a multiprocessor operating system while still using off-the-shelf processors. We use a performance monitor to examine traces of a 4-processor machine running four system-intensive loads under UNIX. Based on our observations, we propose hardware and software support that targets block operations, coherence activity, and cache conflicts. For block operations, simple cache bypassing or prefetching schemes are undesirable. Instead, it is best to use a DMA-like scheme that pipelines the data transfer in the bus without involving the processor. Coherence misses are handled with data privatization and relocation, and the use of updates for a small core of shared variables. Finally, the remaining miss hot spots are handled with data prefetching. Overall, our simulations show that all these optimizations combined eliminate or hide 75% of the operating system data misses in 32-Kbyte primary caches. Furthermore, they speed up the operating system by 19%.