Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management

Authors:
R. Poss;M. Lankamp;Q. Yang;J. Fu;M. W. Van Tol;I. Uddin;C. Jesshope
Affiliations:
-;-;-;-;-;-;-
Venue:
Microprocessors & Microsystems
Year:
2013

Citing 28
Cited 1

MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Can dataflow subsume von Neumann computing?

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
A bridging model for parallel computation

Communications of the ACM
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Sharing and protection in a single-address-space operating system

ACM Transactions on Computer Systems (TOCS) - Special issue on computer architecture
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Parallel Prefix Computation

Journal of the ACM (JACM)
Occam and the transputer

Advances in Petri Nets 1989, covers the 9th European Workshop on Applications and Theory in Petri Nets-selected papers
The Design and Implementation of the FreeBSD Operating System

The Design and Implementation of the FreeBSD Operating System
Optimization and Benchmark of Cryptographic Algorithms on Network Processors

IEEE Micro
Chip Multithreading: Opportunities and Challenges

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
NPCryptBench: a cryptographic benchmark suite for network processors

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Supporting microthread scheduling and synchronisation in CMPs

International Journal of Parallel Programming
SAC: a functional array language for efficient multi-threaded execution

International Journal of Parallel Programming
Instruction Level Parallelism through Microthreading---A Scalable Approach to Chip Multiprocessors

The Computer Journal
The Verification of the On-Chip COMA Cache Coherence Protocol

AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
Implementation and evaluation of a microthread architecture

Journal of Systems Architecture: the EUROMICRO Journal
The implementation of an SVP many-core processor and the evaluation of its memory architecture

ACM SIGARCH Computer Architecture News
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
Strategies for compiling µTC to novel chip Multiprocessors

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
On-chip COMA cache-coherence protocol for microgrids of microthreaded cores

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
The 48-core SCC Processor: the Programmer's View

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Analysis of execution efficiency in the microthreaded processor UTLEON3

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
µTC: an intermediate language for programming chip multiprocessors

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Heterogeneous integration to simplify many-core architecture simulations

Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Apple-CORE: Microgrids of SVP Cores -- Flexible, General-Purpose, Fine-Grained Hardware Concurrency Management

DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional ''accelerator'' approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate irregular long latencies on chip, a scale-invariant programming model, a distributed chip resource model, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This article describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. This article also presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilisation of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.