Optimizing scientific application loops on stream processors

Authors:
Li Wang;Xuejun Yang;Jingling Xue;Yu Deng;Xiaobo Yan;Tao Tang;Quan Hoang Nguyen
Affiliations:
NUDT, ChangSha, China;NUDT, ChangSha, China;UNSW, Sydney, Australia;NDUT, ChangSha, China;NUDT, ChangSha, China;NUDT, ChangSha, China;UNSW, Sydney, Australia
Venue:
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Year:
2008

Citing 27
Cited 7

The priority-based coloring approach to register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
A polynomial time approximation algorithm for Dynamic Storage Allocation

Discrete Mathematics
Improvements to graph coloring register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Iterated register coalescing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for compile-time memory optimization

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Loop tiling for parallelism

Loop tiling for parallelism
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Automatic storage optimization

SIGPLAN '79 Proceedings of the 1979 SIGPLAN symposium on Compiler construction
Register allocation & spilling via graph coloring

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
A generalized algorithm for graph-coloring register allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimistic register coalescing

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimizing stream programs using linear state space analysis

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Memory Coloring: A Compiler Approach for Scratchpad Memory Management

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture
Scratchpad allocation for data aggregates in superperfect graphs

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Register allocation on stream processor with local register file

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture

Comparability graph coloring for optimizing utilization of stream register files in stream processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
SARA: StreAm register allocation

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Loop fusion and reordering for register file optimization on stream processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
Loop fusion and reordering for register file optimization on stream processors

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a graph coloring compiler framework to allocate on-chip SRF(Stream Register File) storage for optimizing scientific applications on stream processors. Our framework consists of first applying enabling optimizations such as loop unrolling to expose stream reuse and opportunities for maximizing parallelism, i.e., overlapping kernel execution and memory transfers.Then the three SRF management tasks are solved in a unified manner via graph coloring: (1) placing streams in the SRF, (2) exploiting stream use, and (3) maximizing parallelism. We evaluate the performance of our compiler framework by actually running nine representative scientific computing kernels on our FT64 stream processor. Our preliminary results show that compiler management achieves an average speedup of 2.3x compared to First-Fit allocation. In comparison with the performance results obtained from running these benchmarks on Itanium 2, an average speedup of 2.1x is observed.