Memory hierarchy design for stream computing

  • Authors:
  • William J. Dally;Nuwan S. Jayasena

  • Affiliations:
  • Stanford University;Stanford University

  • Venue:
  • Memory hierarchy design for stream computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several classes of applications with abundant fine-grain parallelism, such as media and signal processing, graphics, and scientific computing, have become increasingly dominant consumers of computing resources. Prior research has shown that stream processors provide an energy-efficient, programmable approach to achieving high performance for these applications. However, given the strong compute capabilities of these processors, efficient utilization of bandwidth, particularly when accessing off-chip memory, is crucial to sustaining high performance. This thesis explores tradeoffs in, and techniques for, improving the efficiency of memory and bandwidth hierarchy utilization in stream processors. We first evaluate the appropriate granularity for expressing data-level parallelism—entire records or individual words—and show that record-granularity expression of parallelism leads to reduced intermediate state storage requirements and higher sustained bandwidths in modern memory systems. We also explore the effectiveness of software- and hardware-managed memories, and identify the relative merits of each type of memory within the context of stream computing. Software-managed memories are shown to efficiently support coarse-grain and producer-consumer data reuse, while hardware-managed memories are shown to effectively capture fine-grain and irregular temporal reuse. We introduce three new techniques for improving the efficiency of off-chip memory bandwidth utilization. First, we propose a stream register file architecture that enables indexed, arbitrary access patterns, allowing a wider range of data reuse to be captured in on-chip, software-managed memory compared to current stream processors. We then introduce epoch-based cache invalidation—a technique that actively identifies and invalidates dead data—to improve the performance of hardware-managed caches for stream computing. Finally, we propose a hybrid bandwidth hierarchy that incorporates both hardware- and software-managed memory, and allows dynamic reallocation of capacity between these two types of memories to better cater to application requirements. Our analyses and evaluations show that these techniques not only provide performance improvements for existing streaming applications but also broaden the capabilities of stream processors, enabling new classes of applications to be executed efficiently.