Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Introduction to algorithms
Loop optimization techniques on multi-issue architectures
Loop optimization techniques on multi-issue architectures
Algorithms for compile-time memory optimization
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
A comparison of list schedules for parallel processing systems
Communications of the ACM
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Stream processor architecture
Imagine: Media Processing with Streams
IEEE Micro
A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications
CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Automatic storage optimization
SIGPLAN '79 Proceedings of the 1979 SIGPLAN symposium on Compiler construction
A programming system for the imagine media processor
A programming system for the imagine media processor
Evaluating the Imagine Stream Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Analysis and Performance Results of a Molecular Modeling Application on Merrimac
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
SPRINT: a tool to generate concurrent transaction-level models from sequential code
EURASIP Journal on Applied Signal Processing
Hierarchical coarse-grained stream compilation for software defined radio
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Compiling for vector-thread architectures
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Optimizing scientific application loops on stream processors
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Exploiting loop-dependent stream reuse for stream processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
GRAMPS: A programming model for graphics pipelines
ACM Transactions on Graphics (TOG)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
MPSoC Design Using Application-Specific Architecturally Visible Communication
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Stream Compilation for Real-Time Embedded Multicore Systems
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
SRF coloring: stream register file allocation via graph coloring
Journal of Computer Science and Technology
SARA: StreAm register allocation
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Using a configurable processor generator for computer architecture prototyping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
An analytical model to exploit memory task scheduling
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Minimizing communication in rate-optimal software pipelining for stream programs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Application-guided tool development for architecturally diverse computation
Proceedings of the 2010 ACM Symposium on Applied Computing
Control flow emulation on tiled SIMD architectures
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Feedback-directed pipeline parallelism
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Reuse-aware modulo scheduling for stream processors
Proceedings of the Conference on Design, Automation and Test in Europe
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Memory Latency Reduction via Thread Throttling
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A taxonomy of accelerator architectures and their programming models
IBM Journal of Research and Development
Loop fusion and reordering for register file optimization on stream processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors
The Journal of Supercomputing
ACM Transactions on Architecture and Code Optimization (TACO)
Mapping streaming languages to general purpose processors through vectorization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Simulation-based evaluation of the Imagine stream processor with scientific programs
International Journal of High Performance Computing and Networking
Adaptive task duplication using on-line bottleneck detection for streaming applications
Proceedings of the 9th conference on Computing Frontiers
Loop fusion and reordering for register file optimization on stream processors
Journal of Systems and Software
Riposte: a trace-driven compiler and parallel VM for vector code in R
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.