Memory and control organizations of stream processors

  • Authors:
  • William J. Dally;Jung Ho Ahn

  • Affiliations:
  • Stanford University;Stanford University

  • Venue:
  • Memory and control organizations of stream processors
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing importance of numerical applications and the properties of modern VLSI processes have led to a resurgence in the development of architectures with a large number of ALUs, multiple memory channels, and extensive support for parallelism. In particular, stream processors achieve area- and energy-efficient high performance by relying on the abundant parallelism, multiple levels of locality, and predictability of data accesses common to media, signal processing, and scientific application domains. This thesis explores the memory and control organizations of stream processors and extends them in search of memory system structures and ALU control combinations leading to better performance and wider applicable area while using the similar amount of hardware resources. We first study the design space of streaming memory systems in light of the trends of modern DRAMS---increasing concurrency, latency, and sensitivity to access patterns. From a detailed performance analysis using benchmarks with various DRAM parameters and memory-system configurations, we identify read/write turnaround penalties and internal bank conflicts in memory-access threads as the most critical factors affecting performance. Then we present hardware techniques developed to maximize the sustained memory system throughput. Since stream processors heavily rely on parallelism for high performance, certain operations requiring serialization can significantly hurt performance. This can be observed in superposition type updates and histogram computation, which suffer from the memory collision problem. We introduce and detail scatter-add, the data-parallel form of the scalar fetch-and-op, which solves this problem by guaranteeing the atomicity of data accumulation with a memory system. Then we explore the scalability of the stream processor architecture along the instruction, data, and thread level parallelism dimensions. We develop VLSI cost and performance models for a multi-threaded processor in order to study the tradeoffs in functionality and cost of mechanisms that exploit the different types of parallelism. We evaluate the specific effects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.