Apple-CORE: Microgrids of SVP Cores -- Flexible, General-Purpose, Fine-Grained Hardware Concurrency Management

  • Authors:
  • Raphael Poss;Mike Lankamp;Qiang Yang;Jian Fu;Michiel W. van Tol;Chris Jesshope

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. The corresponding hardware implementation provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional "accelerator" approach, Microgrids are intended to be used as components in distributed systems on chip that consider both clusters of small cores and optional larger cores optimized towards sequential performance as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate operations with irregular long latencies, a scale-invariant programming model, a distributed vision of the chip's structure, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This paper describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. The reference chip parameters include 128 cores, a 4 MB on-chip distributed cache network and four DDR3-1600 memory channels. This paper presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilization of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.