Federation: Boosting per-thread performance of throughput-oriented manycore architectures

Authors:
Michael Boyer;David Tarjan;Kevin Skadron
Affiliations:
University of Virginia, Charlottesville, VA;University of Virginia, Charlottesville, VA;University of Virginia, Charlottesville, VA
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2010

Citing 39
Cited 3

Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Energy-efficient hybrid wakeup logic

Proceedings of the 2002 international symposium on Low power electronics and design
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
UltraSparc I: A Four-Issue Processor Supporting Multimedia

IEEE Micro
Characterizing and predicting value degree of use

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
The Alpha 21264 Microprocessor Architecture

ICCD '98 Proceedings of the International Conference on Computer Design
Temperature-aware microarchitecture

Proceedings of the 30th annual international symposium on Computer architecture
Distributed Reorder Buffer Schemes for Low Power

ICCD '03 Proceedings of the 21st International Conference on Computer Design
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Proceedings of the 31st annual international symposium on Computer architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Scaling to the End of Silicon with EDGE Architectures

Computer
Direct Instruction Wakeup for Out-of-Order Processors

IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
Best of Both Latency and Throughput

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Conjoined-Core Chip Multiprocessing

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Power model validation through thermal measurements

Proceedings of the 34th annual international symposium on Computer architecture
Matrix scheduler reloaded

Proceedings of the 34th annual international symposium on Computer architecture
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Composable Lightweight Processors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Federation: repurposing scalar cores for out-of-order instruction issue

Proceedings of the 45th annual Design Automation Conference

MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
ASC: automatically scalable computation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
A hyperscalar dual-core architecture for embedded systems

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.