Improving performance of simple cores by exploiting loop-level parallelism through value prediction and reconfiguration

Authors:
Tameesh Suri;Aneesh Aggarwal
Affiliations:
State University of New York at Binghamton, Binghamton, NY, USA;State University of New York at Binghamton, Binghamton, NY, USA
Venue:
Proceedings of the 6th ACM conference on Computing frontiers
Year:
2009

Citing 33
Cited 0

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
A high-performance microarchitecture with hardware-programmable functional units

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Spyder: a SURE (SUperscalar and REconfigurable) processor

The Journal of Supercomputing - Special issue on field programmable gate arrays
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Highly accurate data value prediction using hybrid predictors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
PipeRench implementation of the instruction path coprocessor

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Instruction generation and regularity extraction for reconfigurable processors

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Processor reconfiguration through instruction-set metamorphosis

Computer
A Single-Chip Multiprocessor

Computer
The Garp Architecture and C Compiler

Computer
Micro-RISC Architecture for the Wireless Market

IEEE Micro
Synthesis of custom processors based on extensible platforms

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Automatic application-specific instruction-set extensions under microarchitectural constraints

Proceedings of the 40th annual Design Automation Conference
The Chimaera reconfigurable functional unit

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
DISE: a programmable macro engine for customizing applications

Proceedings of the 30th annual international symposium on Computer architecture
Automatic generation of application specific processors

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Processor Acceleration Through Automated Instruction Set Customization

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The design of dynamically reconfigurable datapath coprocessors

ACM Transactions on Embedded Computing Systems (TECS)
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
An 8.3GHz dual supply/threshold optimized 32b integer ALU-register file loop in 90nm CMOS

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Serialization-Aware Mini-Graphs: Performance with Fewer Resources

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Improving Scalability and Per-Core Performance in Multi-Cores through Resource Sharing and Reconfiguration

VLSID '09 Proceedings of the 2009 22nd International Conference on VLSI Design
Scalable multi-cores with improved per-core performance using off-the-critical path reconfigurable hardware

HiPC'08 Proceedings of the 15th international conference on High performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a growing trend towards designing simpler CPU cores that have considerable area, complexity, and power advantages. These cores are then leveraged in large-scale multicore processors or in SoCs for hand-held devices. The most significant limitation of such simple CPU cores is their lower performance. In this paper, we propose a technique to improve the performance of simple cores with minimal increase in complexity and area. In particular, we integrate a Reconfigurable Hardware Unit (RHU) that exploits loop-level parallelism to increase the core's overall performance. The RHU is reconfigured to execute instructions with highly predictable operand values from the future iterations of loops. Our experiments show that the proposed architecture improves the performance by an average of about 51% across a wide range of applications, while incurring a area overhead of only about 5.6%.