Exploiting the reuse supplied by loop-dependent stream references for stream processors

  • Authors:
  • Xuejun Yang;Ying Zhang;Xicheng Lu;Jingling Xue;Ian Rogers;Gen Li;Guibin Wang;Xudong Fang

  • Affiliations:
  • National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;The University of New South Wales, Sydeny, Australia;The University of Manchester, Manchester, UK;National University of Defense Technology, Changsha, China;National University of Defense Technology, Changsha, China;National University of Defense Technology, Changsha, China

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse exploitation is only attempted for simple stream references, those whose start and end are known. Compiler analysis, from outside of stream processors, does not directly enable the consideration of other more complex stream references. In this article, we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: lemmas identifying the reuse supplied by stream references, a new abstract representation called the Stream Reuse Graph (SRG) depicting the identified reuse, and the optimization of the SRG for our transformation. Both the reuse between the whole sequences accessed by stream references and between partial sequences is exploited in the article. In particular, partial reuse and its treatment are quite new and have never, to the best of our knowledge, appeared in scalar and vector processing. At the same time, reusing streams increases the pressure on the SRF, and this presents a problem of which reuse should be exploited within limited SRF capacity. We extend our analysis to achieve this objective. Finally, we implement our techniques based on the StreamC/KernelC compiler that has been optimized with the best existing compilation techniques for stream processors. Experimental results show a resultant speed-up of 1.14 to 2.54 times using a range of benchmarks.