Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
IEEE Transactions on Computers
A quasi-minimal residual variant of the Bi-CGSTAB algorithm for nonsymmetric systems
SIAM Journal on Scientific Computing
Scalar replacement in the presence of conditional control flow
Software—Practice & Experience
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Register promotion in C programs
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A new algorithm for scalar register promotion based on SSA form
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Register promotion by sparse partial redundancy elimination of loads and stores
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Stream processor architecture
Reuse-Driven Tiling for Data Locality
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Media Processing Applications on the Imagine Stream Processor
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Efficiently Computing Static Single Assignment Form and the Control Dependence Graph
Efficiently Computing Static Single Assignment Form and the Control Dependence Graph
Dependence analysis for subscripted variables and its application to program transformations
Dependence analysis for subscripted variables and its application to program transformations
A programming system for the imagine media processor
A programming system for the imagine media processor
Programmable Stream Processors
Computer
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Register Files with Indexed Access
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors
Proceedings of the International Symposium on Code Generation and Optimization
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A 64-bit stream processor architecture for scientific applications
Proceedings of the 34th annual international symposium on Computer architecture
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Reuse-aware modulo scheduling for stream processors
Proceedings of the Conference on Design, Automation and Test in Europe
ACM Transactions on Architecture and Code Optimization (TACO)
Simulation-based evaluation of the Imagine stream processor with scientific programs
International Journal of High Performance Computing and Networking
Hi-index | 0.01 |
The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.