Rigel: an architecture and scalable programming interface for a 1000-core accelerator

  • Authors:
  • John H. Kelm;Daniel R. Johnson;Matthew R. Johnson;Neal C. Crago;William Tuohy;Aqeel Mahesri;Steven S. Lumetta;Matthew I. Frank;Sanjay J. Patel

  • Affiliations:
  • University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA

  • Venue:
  • Proceedings of the 36th annual international symposium on Computer architecture
  • Year:
  • 2009

Quantified Score

Hi-index 0.02

Visualization

Abstract

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.