Reducing branch costs via branch alignment

Authors:
Brad Calder;Dirk Grunwald
Affiliations:
Department of Computer Science, Campus Box 430, University of Colorado, Boulder, CO, USA;Department of Computer Science, Campus Box 430, University of Colorado, Boulder, CO, USA
Venue:
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Year:
1994

Citing 21
Cited 42

Reducing the cost of branches

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Branch folding in the CRISP microprocessor: reducing branch delay to zero

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems

IEEE Transactions on Software Engineering
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Branch history table prediction of moving target branches due to subroutine returns

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Strategies for branch target buffers

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Improving the accuracy of dynamic branch prediction using branch correlation

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Predicting conditional branch directions from previous runs of a program

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A comparison of dynamic branch predictors that use two levels of branch history

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Link-time optimization of address calculation on a 64-bit architecture

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Optimal Sequential Partitions of Graphs

Journal of the ACM (JACM)
Improving locality by critical working sets

Communications of the ACM
Branch Target Buffer Design and Optimization

IEEE Transactions on Computers
A study of branch prediction strategies

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture

Corpus-based static branch prediction

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Performance issues in correlated branch prediction schemes

Proceedings of the 28th annual international symposium on Microarchitecture
The predictability of branches in libraries

Proceedings of the 28th annual international symposium on Microarchitecture
A system level perspective on branch architecture performance

Proceedings of the 28th annual international symposium on Microarchitecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Analysis of branch prediction via data compression

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Evidence-based static branch prediction using machine learning

ACM Transactions on Programming Languages and Systems (TOPLAS)
Hot cold optimization of large Windows/NT applications

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Near-optimal intraprocedural branch alignment

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Improving performance by branch reordering

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Analyzing the working set characteristics of branch execution

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Procedure placement using temporal-ordering information

ACM Transactions on Programming Languages and Systems (TOPLAS)
Static correlated branch prediction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Efficient and effective branch reordering using profile data

ACM Transactions on Programming Languages and Systems (TOPLAS)
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors

International Journal of Parallel Programming
Code Positioning for VLIW Architectures

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Branch Prediction Using Profile Data

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Novel Probabilistic Data Flow Framework

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimization opportunities created by global data reordering

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Optimizing indirect branch prediction accuracy in virtual machine interpreters

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Parallelism in the front-end

Proceedings of the 30th annual international symposium on Computer architecture
Software Trace Cache

IEEE Transactions on Computers
Collecting and Exploiting High-Accuracy Call Graph Profiles in Virtual Machines

Proceedings of the international symposium on Code generation and optimization
A first look at the interplay of code reordering and configurable caches

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Code placement for improving dynamic branch prediction accuracy

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
Improving WCET by applying a WC code-positioning optimization

ACM Transactions on Architecture and Code Optimization (TACO)
The Camino Compiler infrastructure

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
Optimizing indirect branch prediction accuracy in virtual machine interpreters

ACM Transactions on Programming Languages and Systems (TOPLAS)
HitME: low power Hit MEmory buffer for embedded systems

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Blind Optimization for Exploiting Hardware Features

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
Multicore-aware hybrid code positioning to reduce worst-case execution time

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Studying microarchitectural structures with object code reordering

Proceedings of the Workshop on Binary Instrumentation and Applications
Code alignment for architectures with pipeline group dispatching

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Combining code reordering and cache configuration

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs.In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamic branch prediction architectures. We also show that a program performance can be improved by approximately 5% even when using recently proposed, highly accurate branch prediction architectures. The programs are compiled by any existing compiler and then transformed via binary transformations. When implementing these algorithms on a Alpha AXP 21604 up to a 16% reduction in total execution time is achieved.