Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Authors:
Jung Ho Ahn;Mattan Erez;William J. Dally
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, California;University of Texas at Austin, Austin, Texas;Stanford University, Stanford, California
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 31
Cited 3

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs

Communications of the ACM
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
One Billion Transistors, One Uniprocessor, One Chip

Computer
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
Baring It All to Software: Raw Machines

Computer
AMD 3DNow! Technology: Architecture and Implementations

IEEE Micro
The Stanford Hydra CMP

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Exploring the VLSI Scalability of Stream Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Overcoming the limitations of conventional vector processors

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Programmable Stream Processors

Computer
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Conditional techniques for stream processing kernels

Conditional techniques for stream processing kernels
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Merrimac: high-performance and highly-efficient scientific computing with streams

Merrimac: high-performance and highly-efficient scientific computing with streams
Memory and control organizations of stream processors

Memory and control organizations of stream processors

Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Mat-core: a decoupled matrix core extension for general-purpose processors

Neural, Parallel & Scientific Computations

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the scalability of the Stream Processor architecture along the instruction-, data-, and thread-level parallelism dimensions. We develop detailed VLSI-cost and processor-performance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit the different types of parallelism. We show that the hardware overhead of supporting coarse-grained independent threads of control is 15 -- 86% depending on machine parameters. We also demonstrate that the performance gains provided are of a smaller magnitude for a set of numerical applications. We argue that for stream applications with scalable parallel algorithms the performance is not very sensitive to the control structures used within a large range of area-efficient architectural choices. We evaluate the specific effects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.