Program optimization for instruction caches

Authors:
S. McFarling
Affiliations:
Computer Systems Laboratory, Stanford University
Venue:
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Year:
1989

Citing 9
Cited 98

Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems

IEEE Transactions on Software Engineering
Optimal Sequential Partitions of Graphs

Journal of the ACM (JACM)
Cache Memories

ACM Computing Surveys (CSUR)
Improving locality by critical working sets

Communications of the ACM
Code Reorginazation for Instruction Caches

Code Reorginazation for Instruction Caches
Automatic storage optimization.

Automatic storage optimization.
Cache management by the compiler

Cache management by the compiler
Aspects of cache memory and instruction buffer performance

Aspects of cache memory and instruction buffer performance

Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The Evolution of Instruction Sequencing

Computer - Special issue on instruction sequencing
The effect of context switches on cache performance

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Improving instruction cache behavior by reducing cache pollution

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Predicting program behavior using real or estimated profiles

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Procedure merging with instruction caches

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Experience with a software-defined machine architecture

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast instruction cache performance evaluation using compile-time analysis

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Cache replacement with dynamic exclusion

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Characterizing the caching and synchronization performance of a multiprocessor operating system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Prefetching in supercomputer instruction caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient simulation of caches under optimal replacement with applications to miss characterization

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The impact of operating system structure on memory system performance

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Efficient software-based fault isolation

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Accurate static estimators for program optimization

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
SIMD instruction cache

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Compile time instruction cache optimizations

ACM SIGARCH Computer Architecture News - Special issue: panel sessions of the 1991 workshop on multithreaded computers
Static branch frequency and program profile analysis

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Trace-directed program restructuring for AIX executables

IBM Journal of Research and Development
Avoiding conflict misses dynamically in large direct-mapped caches

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing branch costs via branch alignment

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Accurate static branch prediction by value range propagation

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction prefetching of systems codes with layout optimized for reduced cache misses

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Hot cold optimization of large Windows/NT applications

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Generating efficient protocol code from an abstract specification

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Analysis of techniques to improve protocol processing latency

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Speeding up protocols for small messages

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Near-optimal intraprocedural branch alignment

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Generating efficient protocol code from an abstract specification

IEEE/ACM Transactions on Networking (TON)
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory data organization for improved cache performance in embedded processor applications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Code placement techniques for cache miss rate reduction

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A Performance Study of Instruction Cache Prefetching Methods

IEEE Transactions on Computers
Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Segregating heap objects by reference behavior and lifetime

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Analysis of Temporal-Based Program Behavior for Improved Instruction Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Optimizing the Instruction Cache Performance of the Operating System

IEEE Transactions on Computers
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Reducing cache misses using hardware and software page placement

ICS '99 Proceedings of the 13th international conference on Supercomputing
Procedure placement using temporal-ordering information

ACM Transactions on Programming Languages and Systems (TOPLAS)
Application-specific memory management for embedded systems using software-controlled caches

Proceedings of the 37th Annual Design Automation Conference
Efficient and Precise Cache Behavior Prediction for Real-TimeSystems

Real-Time Systems
New directions in compiler technology for embedded systems (embedded tutorial)

Proceedings of the 2001 Asia and South Pacific Design Automation Conference
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automated design of finite state machine predictors for customized processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Software-assisted cache replacement mechanisms for embedded systems

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
I-CoPES: fast instruction code placement for embedded systems to improve performance and energy efficiency

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Evaluation of Neural and Genetic Algorithms for Synthesizing Parallel Storage Schemes

International Journal of Parallel Programming
Predicting and Precluding Problems with Memory Latency

IEEE Micro
The Effect of Code Expanding Optimizations on Instruction Cache Design

IEEE Transactions on Computers
Code Positioning for VLIW Architectures

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Compiling for instruction cache performance on a multithreaded architecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reality-based optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Optimal Code Placement of Embedded Software for Instruction Caches

EDTC '96 Proceedings of the 1996 European conference on Design and Test
Optimizing instruction cache performance for operating system intensive workloads

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Predictive sequential associative cache

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Size-Constrained Code Placement for Cache Miss Rate Reduction

ISSS '96 Proceedings of the 9th international symposium on System synthesis
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Profile guided code positioning

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Predicting program behavior using real or estimated profiles

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Array organization in parallel memories

International Journal of Parallel Programming
Profile-directed restructuring of operating system code

IBM Systems Journal
A first look at the interplay of code reordering and configurable caches

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Code placement for improving dynamic branch prediction accuracy

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Frequency-based code placement for embedded multiprocessors

Proceedings of the 42nd annual Design Automation Conference
A non-uniform cache architecture for low power system design

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Instruction code mapping for performance increase and energy reduction in embedded computer systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A novel instruction scratchpad memory optimization method based on concomitance metric

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multi-parametric improvements for embedded systems using code-placement and address bus coding

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
The Camino Compiler infrastructure

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
A cache-defect-aware code placement algorithm for improving the performance of processors

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Fast and efficient partial code reordering: taking advantage of dynamic recompilatior

Proceedings of the 5th international symposium on Memory management
Whole-program optimization of global variable layout

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Compiler optimization to improve data locality for processor multithreading

Scientific Programming
Code reordering on limited branch offset

ACM Transactions on Architecture and Code Optimization (TACO)
Memory behavior of an X11 window system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Optimizing the performance of dynamically-linked programs

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A low power front-end for embedded processors using a block-aware instruction set

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Blind Optimization for Exploiting Hardware Features

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Multicore-aware hybrid code positioning to reduce worst-case execution time

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Improving TriMedia cache performance by profile guided code reordering

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Studying microarchitectural structures with object code reordering

Proceedings of the Workshop on Binary Instrumentation and Applications
Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories

Journal of Signal Processing Systems
Exploiting statistical information for implementation of instruction scratchpad memory in embedded system

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Compartmental memory management in a modern web browser

Proceedings of the international symposium on Memory management
Reducing memory space consumption through dataflow analysis

Computer Languages, Systems and Structures
Optimizing interpreters by tuning opcode orderings on virtual machines for modern architectures: or: how I learned to stop worrying and love hill climbing

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Combining code reordering and cache configuration

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.