Memory hierarchy design for stream computing

Authors:
William J. Dally;Nuwan S. Jayasena
Affiliations:
Stanford University;Stanford University
Venue:
Memory hierarchy design for stream computing
Year:
2005

Citing 0
Cited 11

Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Hierarchical memory system design for a heterogeneous multi-core processor

Proceedings of the 2008 ACM symposium on Applied computing
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Implementation and evaluation of Jacobi iteration on the imagine stream processor

HiPC'07 Proceedings of the 14th international conference on High performance computing
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Scientific computing applications on the imagine stream processor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Optimization and evaluating of StreamYGX2 on MASA stream processor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Matrix-Based programming optimization for improving memory hierarchy performance on imagine

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Architecture-based optimization for mapping scientific applications to imagine

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several classes of applications with abundant fine-grain parallelism, such as media and signal processing, graphics, and scientific computing, have become increasingly dominant consumers of computing resources. Prior research has shown that stream processors provide an energy-efficient, programmable approach to achieving high performance for these applications. However, given the strong compute capabilities of these processors, efficient utilization of bandwidth, particularly when accessing off-chip memory, is crucial to sustaining high performance. This thesis explores tradeoffs in, and techniques for, improving the efficiency of memory and bandwidth hierarchy utilization in stream processors. We first evaluate the appropriate granularity for expressing data-level parallelism—entire records or individual words—and show that record-granularity expression of parallelism leads to reduced intermediate state storage requirements and higher sustained bandwidths in modern memory systems. We also explore the effectiveness of software- and hardware-managed memories, and identify the relative merits of each type of memory within the context of stream computing. Software-managed memories are shown to efficiently support coarse-grain and producer-consumer data reuse, while hardware-managed memories are shown to effectively capture fine-grain and irregular temporal reuse. We introduce three new techniques for improving the efficiency of off-chip memory bandwidth utilization. First, we propose a stream register file architecture that enables indexed, arbitrary access patterns, allowing a wider range of data reuse to be captured in on-chip, software-managed memory compared to current stream processors. We then introduce epoch-based cache invalidation—a technique that actively identifies and invalidates dead data—to improve the performance of hardware-managed caches for stream computing. Finally, we propose a hybrid bandwidth hierarchy that incorporates both hardware- and software-managed memory, and allows dynamic reallocation of capacity between these two types of memories to better cater to application requirements. Our analyses and evaluations show that these techniques not only provide performance improvements for existing streaming applications but also broaden the capabilities of stream processors, enabling new classes of applications to be executed efficiently.