Scalable multi-cores with improved per-core performance using off-the-critical path reconfigurable hardware

Authors:
Tameesh Suri;Aneesh Aggarwal
Affiliations:
Department of Electrical and Computer Engineering, State University of New York at Binghamton, Binghamton, NY;Department of Electrical and Computer Engineering, State University of New York at Binghamton, Binghamton, NY
Venue:
HiPC'08 Proceedings of the 15th international conference on High performance computing
Year:
2008

Citing 26
Cited 1

A high-performance microarchitecture with hardware-programmable functional units

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Spyder: a SURE (SUperscalar and REconfigurable) processor

The Journal of Supercomputing - Special issue on field programmable gate arrays
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory interfacing and instruction specification for reconfigurable processors

FPGA '99 Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays
MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications

IEEE Transactions on Computers
PipeRench implementation of the instruction path coprocessor

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Processor reconfiguration through instruction-set metamorphosis

Computer
A Single-Chip Multiprocessor

Computer
The Garp Architecture and C Compiler

Computer
The Chimaera reconfigurable functional unit

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
The NAPA Adaptive Processing Architecture

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
DISE: a programmable macro engine for customizing applications

Proceedings of the 30th annual international symposium on Computer architecture
Processor Acceleration Through Automated Instruction Set Customization

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Using Dynamic Binary Translation to Fuse Dependent Instructions

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The MOLEN Polymorphic Processor

IEEE Transactions on Computers
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Serialization-Aware Mini-Graphs: Performance with Fewer Resources

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

Improving performance of simple cores by exploiting loop-level parallelism through value prediction and reconfiguration

Proceedings of the 6th ACM conference on Computing frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scaling the number of cores in a multi-core processor constraintsthe resources available in each core, resulting in reduced percoreperformance. Alternatively, the number of cores have to be reducedin order to improve per-core performance. In this paper, we propose atechnique to improve the per-core performance in a many-core processorwithout reducing the number of cores. In particular, we integrate aReconfigurable Hardware Unit (RHU) in each core. The RHU executesthe frequently encountered instructions to increase the core's overall executionbandwidth, thus improving its performance. We also propose anovel integrated hardware/software methodology for efficient RHU reconfiguration.The RHU has low area overhead, and hence has minimalimpact on the scalability of the multi-core. Our experiments show thatthe proposed architecture improves the per-core performance by an averageof about 12% across a wide range of applications, while incurringa per-core area overhead of only about 5%.