Profile guided code positioning

Authors:
Karl Pettis;Robert C. Hansen
Affiliations:
Hewlett-Packard Company, California Language Laboratory, 19447 Pruneridge Avenue, Cupertino, California;Hewlett-Packard Company, California Language Laboratory, 19447 Pruneridge Avenue, Cupertino, California
Venue:
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Year:
1990

Citing 8
Cited 171

The effect of instruction set complexity on program size and memory performance

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
DOC: a practical approach to source-level debugging of globally optimized code

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems

IEEE Transactions on Software Engineering
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Determining average program execution times and their variance

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Improving locality by critical working sets

Communications of the ACM
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction

CCG: a prototype coagulating code generator

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Predicting program behavior using real or estimated profiles

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Procedure merging with instruction caches

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Experience with a software-defined machine architecture

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast instruction cache performance evaluation using compile-time analysis

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Cache replacement with dynamic exclusion

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Optimally profiling and tracing programs

POPL '92 Proceedings of the 19th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Ordering functions for improving memory reference locality in a shared memory multiprocessor system

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Efficient software-based fault isolation

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Improving semi-static branch prediction by code replication

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compile time instruction cache optimizations

ACM SIGARCH Computer Architecture News - Special issue: panel sessions of the 1991 workshop on multithreaded computers
Optimally profiling and tracing programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Static branch frequency and program profile analysis

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Trace-directed program restructuring for AIX executables

IBM Journal of Research and Development
Avoiding conflict misses dynamically in large direct-mapped caches

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing branch costs via branch alignment

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Accurate static branch prediction by value range propagation

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Simple and effective link-time optimization of Modula-3 programs

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Performance issues in correlated branch prediction schemes

Proceedings of the 28th annual international symposium on Microarchitecture
The predictability of branches in libraries

Proceedings of the 28th annual international symposium on Microarchitecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Evidence-based static branch prediction using machine learning

ACM Transactions on Programming Languages and Systems (TOPLAS)
Hot cold optimization of large Windows/NT applications

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Generating efficient protocol code from an abstract specification

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Analysis of techniques to improve protocol processing latency

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Cache behavior of network protocols

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Dynamic feedback: an effective technique for adaptive computing

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Interprocedural dataflow analysis in an executable optimizer

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Aggressive inlining

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Near-optimal intraprocedural branch alignment

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Resource-bounded partial evaluation

PEPM '97 Proceedings of the 1997 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
Generating efficient protocol code from an abstract specification

IEEE/ACM Transactions on Networking (TON)
System support for automatic profiling and optimization

Proceedings of the sixteenth ACM symposium on Operating systems principles
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
FACT: a framework for the application of throughput and power optimizing transformations to control-flow intensive behavioral descriptions

DAC '98 Proceedings of the 35th annual Design Automation Conference
Scalable cross-module optimization

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Execution characteristics of desktop applications on Windows NT

Proceedings of the 25th annual international symposium on Computer architecture
Compact and efficient presentation conversion code

IEEE/ACM Transactions on Networking (TON)
Better global scheduling using path profiles

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Overlapping execution with transfer using non-strict execution for mobile programs

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Analysis of Temporal-Based Program Behavior for Improved Instruction Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Optimizing the Instruction Cache Performance of the Operating System

IEEE Transactions on Computers
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies

IEEE Transactions on Computers
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Reducing cache misses using hardware and software page placement

ICS '99 Proceedings of the 13th international conference on Supercomputing
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback

ACM Transactions on Computer Systems (TOCS)
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Reducing transfer delay using Java class file splitting and prefetching

Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Procedure placement using temporal-ordering information

ACM Transactions on Programming Languages and Systems (TOPLAS)
Static correlated branch prediction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Precise Cache Behavior Prediction for Real-TimeSystems

Real-Time Systems
A hardware mechanism for dynamic extraction and relayout of program hot spots

Proceedings of the 27th annual international symposium on Computer architecture
Compiler techniques for code compaction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Architectural and compiler support for effective instruction prefetching: a cooperative approach

ACM Transactions on Computer Systems (TOCS)
Offline program re-mapping to improve branch prediction efficiency in embedded systems

ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Partial method compilation using dynamic profile information

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A fast on-chip profiler memory

Proceedings of the 39th annual Design Automation Conference
Handling irreducible loops: optimized node splitting versus DJ-graphs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Online feedback-directed optimization of Java

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Software Trace Cache for Commercial Applications

International Journal of Parallel Programming
Walk-Time Techniques: Catalyst for Architectural Change

Computer
Dynamic and Transparent Binary Translation

Computer
The Effect of Code Expanding Optimizations on Instruction Cache Design

IEEE Transactions on Computers
The set-associative cache performance of search trees

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Modeling the impact of run-time uncertainty on optimal computation scheduling using feedback

ICPP '97 Proceedings of the international Conference on Parallel Processing
Code Positioning for VLIW Architectures

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Speculative Alias Analysis for Executable Code

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
On the Performance of Fetch Engines Running DSS Workloads

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Load Redundancy Elimination on Executable Code

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Goal-Directed Value Profiling

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Energy frugal tags in reprogrammable I-caches for application-specific embedded processors

Proceedings of the tenth international symposium on Hardware/software codesign
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Compiling for instruction cache performance on a multithreaded architecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Reality-based optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Optimization opportunities created by global data reordering

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Continuous program optimization: A case study

ACM Transactions on Programming Languages and Systems (TOPLAS)
On the side-effects of code abstraction

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
A region-based compilation technique for a Java just-in-time compiler

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Optimized code restructuring of OS/2 executables

CASCON '95 Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research
The Use of Feedback in Scheduling Parallel Computations

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Frequent loop detection using efficient non-intrusive on-chip hardware

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Improving spatial locality of programs via data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Cache-Aware Scratchpad Allocation Algorithm

Proceedings of the conference on Design, automation and test in Europe - Volume 2
Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting program behavior using real or estimated profiles

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Profile-directed restructuring of operating system code

IBM Systems Journal
Procedure placement using temporal-ordering information: dealing with code size expansion

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Software Trace Cache

IEEE Transactions on Computers
Using trace analysis for improving performance in COTS systems

CASCON '04 Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research
Performance of Runtime Optimization on BLAST

Proceedings of the international symposium on Code generation and optimization
A first look at the interplay of code reordering and configurable caches

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Dynamic run-time architecture techniques for enabling continuous optimization

Proceedings of the 2nd conference on Computing frontiers
Code placement for improving dynamic branch prediction accuracy

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Evolution of a java just-in-time compiler for IA-32 platforms

IBM Journal of Research and Development
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware

IEEE Transactions on Computers
Link-time binary rewriting techniques for program compaction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Continuous Path and Edge Profiling

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
A region-based compilation technique for dynamic compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing instruction cache performance of embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Improving WCET by applying a WC code-positioning optimization

ACM Transactions on Architecture and Code Optimization (TACO)
2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

Proceedings of the International Symposium on Code Generation and Optimization
Dynamic inference of polymorphic lock types

Science of Computer Programming - Special issue: Concurrency and synchronization in Java programs
Fast and efficient partial code reordering: taking advantage of dynamic recompilatior

Proceedings of the 5th international symposium on Memory management
Whole-program optimization of global variable layout

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Adapting compilation techniques to enhance the packing of instructions into registers

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Scratchpad memory management for portable systems with a memory management unit

EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
TRICK: tracking and reusing compiler's knowledge

ACM SIGPLAN Notices
Procedure placement using temporal-ordering information: Dealing with code size expansion

Journal of Embedded Computing - Cache exploitation in embedded systems
Ablego: a function outlining and partial inlining framework: Research Articles

Software—Practice & Experience
Code reordering on limited branch offset

ACM Transactions on Architecture and Code Optimization (TACO)
Online optimizations driven by hardware performance monitoring

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Protecting against unexpected system calls

SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
External memory page remapping for embedded multimedia systems

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
DRIM: a low power dynamically reconfigurable instruction memory hierarchy for embedded systems

Proceedings of the conference on Design, automation and test in Europe
Improving UNIX kernel performance using profile based optimization

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Optimizing the performance of dynamically-linked programs

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Instrumentation and optimization of Win32/intel executables using Etch

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Spike: an optimizer for alpha/NT executables

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Improving instruction locality with just-in-time code layout

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Reducing startup latency in web and desktop applications

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
A low power front-end for embedded processors using a block-aware instruction set

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Dynamic scratchpad memory management for code in portable systems with an MMU

ACM Transactions on Embedded Computing Systems (TECS)
Trace fragment selection within method-based JVMs

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Dynamic round-robin task scheduling to reduce cache misses for embedded systems

Proceedings of the conference on Design, automation and test in Europe
Dynamic and On-Line Design Space Exploration for Reconfigurable Architectures

Transactions on High-Performance Embedded Architectures and Compilers I
Blind Optimization for Exploiting Hardware Features

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Linux Kernel Compaction through Cold Code Swapping

Transactions on High-Performance Embedded Architectures and Compilers II
Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Feedback-directed specialization of code

Computer Languages, Systems and Structures
Inferred call path profiling

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Novel online profiling for virtual machines

Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
A hardware/software framework for instruction and data scratchpad memory allocation

ACM Transactions on Architecture and Code Optimization (TACO)
Multicore-aware hybrid code positioning to reduce worst-case execution time

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Improving TriMedia cache performance by profile guided code reordering

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Run-time randomization to mitigate tampering

IWSEC'07 Proceedings of the Security 2nd international conference on Advances in information and computer security
Code arrangement of embedded java virtual machine for NAND flash memory

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Studying microarchitectural structures with object code reordering

Proceedings of the Workshop on Binary Instrumentation and Applications
Evaluating the dynamic behaviour of Python applications

ACSC '09 Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91
Fine-grain dynamic instruction placement for L0 scratch-pad memory

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Improved procedure placement for set associative caches

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Improving the performance of trace-based systems by false loop filtering

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Cache sensitive code arrangement for virtual machine

Transactions on high-performance embedded architectures and compilers III
Interpreter instruction scheduling

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Loaf: a framework and infrastructure for creating online adaptive solutions

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Reducing memory space consumption through dataflow analysis

Computer Languages, Systems and Structures
Improving performance through deep value profiling and specialization with code transformation

Computer Languages, Systems and Structures
Using platform-specific performance counters for dynamic compilation

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing interpreters by tuning opcode orderings on virtual machines for modern architectures: or: how I learned to stop worrying and love hill climbing

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Efficient scratchpad allocation algorithms for energy constrained embedded systems

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Automatic code overlay generation and partially redundant code fetch elimination

ACM Transactions on Architecture and Code Optimization (TACO)
An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Combining code reordering and cache configuration

ACM Transactions on Embedded Computing Systems (TECS)
Survey of Low-Energy Techniques for Instruction Memory Organisations in Embedded Systems

Journal of Signal Processing Systems
Simple profile rectifications go a long way

ECOOP'13 Proceedings of the 27th European conference on Object-Oriented Programming
Post-compiler software optimization for reducing energy

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Reducing instruction fetch energy in multi-issue processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper presents the results of our investigation of code positioning techniques using execution profile data as input into the compilation process. The primary objective of the positioning is to reduce the overhead of the instruction memory hierarchy.After initial investigation in the literature, we decided to implement two prototypes for the Hewlett-Packard Precision Architecture (PA-RISC). The first, built on top of the linker, positions code based on whole procedures. This prototype has the ability to move procedures into an order that is determined by a “closest is best” strategy.The second prototype, built on top of an existing optimizer package, positions code based on basic blocks within procedures. Groups of basic blocks that would be better as straight-line sequences are identified as chains. These chains are then ordered according to branch heuristics. Code that is never executed during the data collection runs can be physically separated from the primary code of a procedure by a technique we devised called procedure splitting.The algorithms we implemented are described through examples in this paper. The performance improvements from our work are also summarized in various tables and charts.