Exploiting the reuse supplied by loop-dependent stream references for stream processors

Authors:
Xuejun Yang;Ying Zhang;Xicheng Lu;Jingling Xue;Ian Rogers;Gen Li;Guibin Wang;Xudong Fang
Affiliations:
National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;The University of New South Wales, Sydeny, Australia;The University of Manchester, Manchester, UK;National University of Defense Technology, Changsha, China;National University of Defense Technology, Changsha, China;National University of Defense Technology, Changsha, China
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2008

Citing 34
Cited 1

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Vector Register Allocation

IEEE Transactions on Computers
A quasi-minimal residual variant of the Bi-CGSTAB algorithm for nonsymmetric systems

SIAM Journal on Scientific Computing
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Register promotion in C programs

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A new algorithm for scalar register promotion based on SSA form

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Register promotion by sparse partial redundancy elimination of loads and stores

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Stream processor architecture

Stream processor architecture
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Reuse-Driven Tiling for Data Locality

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Dependence analysis for subscripted variables and its application to program transformations

Dependence analysis for subscripted variables and its application to program transformations
A programming system for the imagine media processor

A programming system for the imagine media processor
Programmable Stream Processors

Computer
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Memory hierarchy design for stream computing

Memory hierarchy design for stream computing
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Scientific Computing Applications on a Stream Processor

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse exploitation is only attempted for simple stream references, those whose start and end are known. Compiler analysis, from outside of stream processors, does not directly enable the consideration of other more complex stream references. In this article, we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: lemmas identifying the reuse supplied by stream references, a new abstract representation called the Stream Reuse Graph (SRG) depicting the identified reuse, and the optimization of the SRG for our transformation. Both the reuse between the whole sequences accessed by stream references and between partial sequences is exploited in the article. In particular, partial reuse and its treatment are quite new and have never, to the best of our knowledge, appeared in scalar and vector processing. At the same time, reusing streams increases the pressure on the SRF, and this presents a problem of which reuse should be exploited within limited SRF capacity. We extend our analysis to achieve this objective. Finally, we implement our techniques based on the StreamC/KernelC compiler that has been optimized with the best existing compilation techniques for stream processors. Experimental results show a resultant speed-up of 1.14 to 2.54 times using a range of benchmarks.