Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

Authors:
Yunsup Lee;Rimas Avizienis;Alex Bishara;Richard Xia;Derek Lockhart;Christopher Batten;Krste Asanović
Affiliations:
University of California, Berkeley;University of California, Berkeley;Stanford University;University of California, Berkeley;Cornell University;Cornell University;University of California, Berkeley
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2013

Citing 39
Cited 0

Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Spert-II: A Vector Microprocessor System

Computer - Special issue: neural computing: companion issue to Spring 1996 IEEE Computational Science & Engineering
Vector instruction set support for conditional operations

Proceedings of the 27th annual international symposium on Computer architecture
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
VIS Speeds New Media Processing

IEEE Micro
MMX Technology Extension to the Intel Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Implementing Streaming SIMD Extensions on the Pentium III Processor

IEEE Micro
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Vector microprocessors

Vector microprocessors
Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Cache Refill/Access Decoupling for Vector Machines

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
RPU: a programmable ray processing unit for realtime ray tracing

ACM SIGGRAPH 2005 Papers
Parallelism and the ARM Instruction Set Architecture

Computer
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Vector Lane Threading

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
The Cray BlackWidow: a highly scalable vector multiprocessor

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Implementing the scale vector-thread processor

ACM Transactions on Design Automation of Electronic Systems (TODAES)
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Vector-thread architecture and implementation

Vector-thread architecture and implementation
A functional description of the Lincoln TX-2 computer

IRE-AIEE-ACM '57 (Western) Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability
Intel threading building blocks

Intel threading building blocks
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Tradeoffs in designing accelerator architectures for visual computing

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
A Task-Centric Memory Model for Scalable Accelerator Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Performance evaluation of NEC SX-9 using real science and engineering applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The IBM System/370 vector architecture

IBM Systems Journal
Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

Simplified vector-thread architectures for flexible and efficient data-parallel accelerators
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We introduce Maven, a new VT microarchitecture based on the traditional vector-SIMD microarchitecture, that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of compiled microbenchmarks and application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.