Exploiting loop-dependent stream reuse for stream processors

Authors:
Xuejun Yang;Ying Zhang;Jingling Xue;Ian Rogers;Gen Li;Guibin Wang
Affiliations:
National University of Defence Technology, ChangSha, China;National University of Defence Technology, Changsha, China;The University of New South Wales, Sydney, Australia;The University of Manchester, Manchester, United Kngdm;National University of Defence Technology, Changsha, China;National University of Defence Technology, Changsha, China
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 27
Cited 5

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Vector Register Allocation

IEEE Transactions on Computers
A quasi-minimal residual variant of the Bi-CGSTAB algorithm for nonsymmetric systems

SIAM Journal on Scientific Computing
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Register promotion in C programs

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A new algorithm for scalar register promotion based on SSA form

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Register promotion by sparse partial redundancy elimination of loads and stores

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Stream processor architecture

Stream processor architecture
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Reuse-Driven Tiling for Data Locality

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Efficiently Computing Static Single Assignment Form and the Control Dependence Graph

Efficiently Computing Static Single Assignment Form and the Control Dependence Graph
Dependence analysis for subscripted variables and its application to program transformations

Dependence analysis for subscripted variables and its application to program transformations
A programming system for the imagine media processor

A programming system for the imagine media processor
Programmable Stream Processors

Computer
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture

Comparability graph coloring for optimizing utilization of stream register files in stream processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
Simulation-based evaluation of the Imagine stream processor with scientific programs

International Journal of High Performance Computing and Networking

Quantified Score

Hi-index	0.01

Visualization

Abstract

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.