Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Authors:
John H. Kelm;Daniel R. Johnson;Matthew R. Johnson;Neal C. Crago;William Tuohy;Aqeel Mahesri;Steven S. Lumetta;Matthew I. Frank;Sanjay J. Patel
Affiliations:
University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 24
Cited 24

The architecture of HEP

on Parallel MIMD computation: HEP supercomputer and its applications
Scans as Primitive Parallel Operations

IEEE Transactions on Computers
A bridging model for parallel computation

Communications of the ACM
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
The network architecture of the connection machine CM-5

Journal of Parallel and Distributed Computing
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
CEDAR: a large scale multiprocessor

ACM SIGARCH Computer Architecture News
Programmable Stream Processors

Computer
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
N-Body simulation on GPUs

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
GPUs: A Closer Look

Queue - GPU Computing
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Accelerating advanced MRI reconstructions on GPUs

Journal of Parallel and Distributed Computing
Tradeoffs in designing accelerator architectures for visual computing

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Hierarchically tiled arrays for parallelism and locality

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Mesh-based many-core performance under process variations: a core yield perspective

ACM SIGARCH Computer Architecture News
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Leakage-saving opportunities in mesh-based massive multi-core architectures

ACM SIGARCH Computer Architecture News
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
MEDICS: ultra-portable processing for medical image reconstruction

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
GoldMine: automatic assertion generation using data mining and static analysis

Proceedings of the Conference on Design, Automation and Test in Europe
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

Proceedings of the 38th annual international symposium on Computer architecture
DreamWeaver: architectural support for deep sleep

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Implementing a GPU programming model on a Non-GPU accelerator architecture

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Why on-chip cache coherence is here to stay

Communications of the ACM
A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching

Proceedings of the 9th conference on Computing Frontiers
CRAW/P: a workload partition method for the efficient parallel simulation of manycores

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)
An energy and bandwidth efficient ray tracing architecture

Proceedings of the 5th High-Performance Graphics Conference
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
TornadoNoC: A lightweight and scalable on-chip network architecture for the many-core era

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.