Enhancing instruction scheduling with a block-structured ISA
International Journal of Parallel Programming
The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exceeding the dataflow limit via value prediction
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
Proceedings of the 24th annual international symposium on Computer architecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The predictability of data values
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Highly accurate data value prediction using hybrid predictors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Advanced compiler design and implementation
Advanced compiler design and implementation
An empirical analysis of instruction repetition
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
The Superthreaded Processor Architecture
IEEE Transactions on Computers
Performance Study of a Concurrent Multithreaded Processor
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Exploiting Basic Block Value Locality with Block Reuse
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Application-driven processor design exploration for power-performance trade-off analysis
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
A Prolog Tailoring Technique on an Epilog Tailored Procedure
PSI '02 Revised Papers from the 4th International Andrei Ershov Memorial Conference on Perspectives of System Informatics: Akademgorodok, Novosibirsk, Russia
Balancing Reuse Opportunities and Performance Gains with Subblock Value Reuse
IEEE Transactions on Computers
An embedded multi-resolution AMBA trace analyzer for microprocessor-based SoC integration
Proceedings of the 44th annual Design Automation Conference
Limits for a feasible speculative trace reuse implementation
International Journal of High Performance Systems Architecture
Window memoization: an efficient hardware architecture for high-performance image processing
Journal of Real-Time Image Processing
Hi-index | 14.98 |
Speculative execution and instruction reuse are two important strategies that have been investigated for improving processor performance. Value prediction at the instruction level has been introduced to allow even more aggressive speculation and reuse than previous techniques. This study suggests that using compiler support to extend value reuse to a coarser granularity than a single instruction, such as a basic block, may have substantial performance benefits. We investigate the input and output values of basic blocks and find that these values can be quite regular and predictable. For the SPEC benchmark programs evaluated, 90 percent of the basic blocks have fewer than four register inputs, five live register outputs, four memory inputs, and two memory outputs. About 16 to 41 percent of all the basic blocks are simply repeating earlier calculations when the programs are compiled with the -O2 optimization level in the GCC compiler. Compiler optimizations, such as loop-unrolling and function inlining, affect the sizes of basic blocks, but have no significant or consistent impact on their value locality, nor the resulting performance. Based on these results, we evaluate the potential benefit of basic block reuse using a novel mechanism called the block history buffer. This mechanism records input and live output values of basic blocks to provide value reuse at the basic block level. Simulation results show that using a reasonably sized block history buffer to provide basic block reuse in a 4-way issue superscalar processor can improve execution time for the tested SPEC programs by 1 to 14 percent, with an overall average of 9 percent when using reasonable hardware assumptions.