Cache coherence protocols: evaluation using a multiprocessor simulation model
ACM Transactions on Computer Systems (TOCS)
An evaluation of directory schemes for cache coherence
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
ACM SIGARCH Computer Architecture News
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An evaluation of memory consistency models for shared-memory systems with ILP processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Programming with POSIX threads
Programming with POSIX threads
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Foundations of Parallel and Distributed Programming
Foundations of Parallel and Distributed Programming
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
A survey of processors with explicit multithreading
ACM Computing Surveys (CSUR)
Implementing Multithreaded Protocols for Release Consistency on Top of the Generic DSM-PM Platform
IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
The SPMD Model: Past, Present and Future
Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
DSPxPlore: design space exploration methodology for an embedded DSP core
Proceedings of the 2004 ACM symposium on Applied computing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Core architecture optimization for heterogeneous chip multiprocessors
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 44th annual Design Automation Conference
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
IEEE Transactions on Computers
The case for simple, visible cache coherency
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Amdahl's Law in the Multicore Era
Computer
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures
DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools
A Hardware Task Scheduler for Embedded Video Processing
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Finding Stress Patterns in Microprocessor Workloads
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Overlay techniques for scratchpad memories in low power embedded processors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm2 and 8.6 mm2, respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.