A Flexible Heterogeneous Multi-Core Architecture

Authors:
Miquel Pericas;Adrian Cristal;Francisco J. Cazorla;Ruben Gonzalez;Daniel A. Jimenez;Mateo Valero
Affiliations:
Universitat Politecnica de Catalunya, Spain/ Barcelona Supercomputing Center, Spain;Barcelona Supercomputing Center, Spain;Barcelona Supercomputing Center, Spain;Universitat Politecnica de Catalunya, Spain;The University of Texas at San Antonio, USA;Universitat Politecnica de Catalunya, Spain/ Barcelona Supercomputing Center, Spain
Venue:
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Year:
2007

Citing 0
Cited 12

Dma-based prefetching for i/o-intensive workloads on the cell architecture

Proceedings of the 5th conference on Computing frontiers
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
Design of a cache hierarchy for LogN and LogN+1 model for multi-level cache system for multi-core processors

Proceedings of the 7th International Conference on Frontiers of Information Technology
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
Parallel pattern detection for architectural improvements

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special section on adaptive power management for energy and temperature-aware computing systems
An opportunistic prediction-based thread scheduling to maximize throughput/watt in AMPs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-core processors naturally exploit thread-level par- allelism (TLP). However, extracting instruction-level paral- lelism (ILP) from individual applications or threads is still a challenge as application mixes in this environment are nonuniform. Thus, multi-core processors should be flexi- ble enough to provide high throughput for uniform paral- lel applications as well as high performance for more gen- eral workloads. Heterogeneous architectures are a first step in this direction, but partitioning remains static and only roughly fits application requirements. This paper proposes the Flexible Heterogeneous Mul- tiCore processor (FMC), the first dynamic heterogeneous multi-core architecture capable of reconfiguring itself to fit application requirements without programmer intervention. The basic building block of this microarchitecture is a scal- able, variable-size window microarchitecture that exploits the concept of Execution Locality to provide large-window capabilities. This allows to overcome the memory wall for applications with high memory-level parallelism (MLP). The microarchitecture contains a set of small and fast cache processors that execute high locality code and a network of small in-order memory engines that together exploit low locality code. Single-threaded applications can use the entire network of cores while multi-threaded applications can effi- ciently share the resources. The sizing of critical structures remains small enough to handle current power envelopes. In single-threaded mode this processor is able to out- perform previous state-of-the-art high-performance proces- sor research by 12% on SpecFP. We show how in a quad- threaded/quad-core environment the processor outperforms a statically allocated configuration in both throughput and harmonic mean, two commonly used metrics to evaluate SMT performance, by around 2-4%. This is achieved while using a very simple sharing algorithm.