Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Authors:
Andrei Terechko;Jan Hoogerbrugge;Ghiath Alkadi;Surendra Guntur;Anirban Lahiri;Marc Duranton;Clemens Wüst;Phillip Christie;Axel Nackaerts;Aatish Kumar
Affiliations:
NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands;NXP Semiconductors, The Netherlands
Venue:
ACM Transactions on Embedded Computing Systems (TECS)
Year:
2012

Citing 37
Cited 0

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
What is scalability?

ACM SIGARCH Computer Architecture News
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Programming with POSIX threads

Programming with POSIX threads
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Lock-free reference counting

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Foundations of Parallel and Distributed Programming

Foundations of Parallel and Distributed Programming
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Portable Programs for Parallel Processors

Portable Programs for Parallel Processors
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
Implementing Multithreaded Protocols for Release Consistency on Top of the Generic DSM-PM Platform

IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
The SPMD Model: Past, Present and Future

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
DSPxPlore: design space exploration methodology for an embedded DSP core

Proceedings of the 2004 ACM symposium on Applied computing
The TM3270 Media-Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Core architecture optimization for heterogeneous chip multiprocessors

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
The kill rule for multicore

Proceedings of the 44th annual Design Automation Conference
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

IEEE Transactions on Computers
The case for simple, visible cache coherency

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems

IEEE Micro
Amdahl's Law in the Multicore Era

Computer
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures

DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools
Accelerating Video-Mining Applications Using Many Small, General-Purpose Cores

IEEE Micro
Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era

Computer
A Hardware Task Scheduler for Embedded Video Processing

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Finding Stress Patterns in Microprocessor Workloads

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Overlay techniques for scratchpad memories in low power embedded processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm2 and 8.6 mm2, respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.