Optimizing instruction cache performance for operating system intensive workloads

Authors:
J. Torrellas;Chun Xia;R. Daigle
Affiliations:
-;-;-
Venue:
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Year:
1995

Citing 12
Cited 26

Cache performance of operating system and multiprogramming workloads

ACM Transactions on Computer Systems (TOCS)
The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Trace selection for compiling large C application programs to microcode

MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Characterizing the caching and synchronization performance of a multiprocessor operating system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The impact of operating system structure on memory system performance

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Compile time instruction cache optimizations

ACM SIGARCH Computer Architecture News - Special issue: panel sessions of the 1991 workshop on multithreaded computers
Optimal allocation of on-chip memory for multiple-API operating systems

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Cache Performance in the VAX-11/780

ACM Transactions on Computer Systems (TOCS)
The Effect of Code Expanding Optimizations on Instruction Cache Design

IEEE Transactions on Computers

The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Near-optimal intraprocedural branch alignment

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems

IEEE Transactions on Parallel and Distributed Systems
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Overlapping execution with transfer using non-strict execution for mobile programs

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Analysis of Temporal-Based Program Behavior for Improved Instruction Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies

IEEE Transactions on Computers
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Procedure placement using temporal-ordering information

ACM Transactions on Programming Languages and Systems (TOPLAS)
Static correlated branch prediction

ACM Transactions on Programming Languages and Systems (TOPLAS)
An analysis of operating system behavior on a simultaneous multithreaded architecture

ACM SIGPLAN Notices
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Software Trace Cache for Commercial Applications

International Journal of Parallel Programming
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Compiling for instruction cache performance on a multithreaded architecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reality-based optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Improving the Data Cache Performance of Multiprocessor Operating Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A low-cost memory architecture with NAND XIP for mobile embedded systems

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Software Trace Cache

IEEE Transactions on Computers
A Hardware-Software Platform for Intrusion Prevention

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic round-robin task scheduling to reduce cache misses for embedded systems

Proceedings of the conference on Design, automation and test in Europe
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications.