Memory and control organizations of stream processors

Authors:
William J. Dally;Jung Ho Ahn
Affiliations:
Stanford University;Stanford University
Venue:
Memory and control organizations of stream processors
Year:
2007

Citing 0
Cited 3

Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing importance of numerical applications and the properties of modern VLSI processes have led to a resurgence in the development of architectures with a large number of ALUs, multiple memory channels, and extensive support for parallelism. In particular, stream processors achieve area- and energy-efficient high performance by relying on the abundant parallelism, multiple levels of locality, and predictability of data accesses common to media, signal processing, and scientific application domains. This thesis explores the memory and control organizations of stream processors and extends them in search of memory system structures and ALU control combinations leading to better performance and wider applicable area while using the similar amount of hardware resources. We first study the design space of streaming memory systems in light of the trends of modern DRAMS---increasing concurrency, latency, and sensitivity to access patterns. From a detailed performance analysis using benchmarks with various DRAM parameters and memory-system configurations, we identify read/write turnaround penalties and internal bank conflicts in memory-access threads as the most critical factors affecting performance. Then we present hardware techniques developed to maximize the sustained memory system throughput. Since stream processors heavily rely on parallelism for high performance, certain operations requiring serialization can significantly hurt performance. This can be observed in superposition type updates and histogram computation, which suffer from the memory collision problem. We introduce and detail scatter-add, the data-parallel form of the scalar fetch-and-op, which solves this problem by guaranteeing the atomicity of data accumulation with a memory system. Then we explore the scalability of the stream processor architecture along the instruction, data, and thread level parallelism dimensions. We develop VLSI cost and performance models for a multi-threaded processor in order to study the tradeoffs in functionality and cost of mechanisms that exploit the different types of parallelism. We evaluate the specific effects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.