on Parallel MIMD computation: HEP supercomputer and its applications
Scans as Primitive Parallel Operations
IEEE Transactions on Computers
A bridging model for parallel computation
Communications of the ACM
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
The network architecture of the connection machine CM-5
Journal of Parallel and Distributed Computing
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
CEDAR: a large scale multiprocessor
ACM SIGARCH Computer Architecture News
Programmable Stream Processors
Computer
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Design tradeoffs for tiled CMP on-chip networks
Proceedings of the 20th annual international conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Queue - GPU Computing
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Accelerating advanced MRI reconstructions on GPUs
Journal of Parallel and Distributed Computing
Tradeoffs in designing accelerator architectures for visual computing
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Hierarchically tiled arrays for parallelism and locality
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Mesh-based many-core performance under process variations: a core yield perspective
ACM SIGARCH Computer Architecture News
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Leakage-saving opportunities in mesh-based massive multi-core architectures
ACM SIGARCH Computer Architecture News
Cohesion: a hybrid memory model for accelerators
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
WAYPOINT: scaling coherence to thousand-core architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
MEDICS: ultra-portable processing for medical image reconstruction
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
GoldMine: automatic assertion generation using data mining and static analysis
Proceedings of the Conference on Design, Automation and Test in Europe
OUTRIDER: efficient memory latency tolerance with decoupled strands
Proceedings of the 38th annual international symposium on Computer architecture
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Proceedings of the 38th annual international symposium on Computer architecture
DreamWeaver: architectural support for deep sleep
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Implementing a GPU programming model on a Non-GPU accelerator architecture
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Why on-chip cache coherence is here to stay
Communications of the ACM
Proceedings of the 9th conference on Computing Frontiers
CRAW/P: a workload partition method for the efficient parallel simulation of manycores
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Exploring memory consistency for massively-threaded throughput-oriented processors
Proceedings of the 40th Annual International Symposium on Computer Architecture
Locality-aware task management for unstructured parallelism: a quantitative limit study
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators
ACM Transactions on Computer Systems (TOCS)
An energy and bandwidth efficient ray tracing architecture
Proceedings of the 5th High-Performance Graphics Conference
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
TornadoNoC: A lightweight and scalable on-chip network architecture for the many-core era
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.02 |
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.