Compiling for stream processing

Authors:
Abhishek Das;William J. Dally;Peter Mattson
Affiliations:
Stanford University;Stanford University;Stream Processors, Inc.
Venue:
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Year:
2006

Citing 15
Cited 35

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Introduction to algorithms

Introduction to algorithms
Loop optimization techniques on multi-issue architectures

Loop optimization techniques on multi-issue architectures
Algorithms for compile-time memory optimization

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
A comparison of list schedules for parallel processing systems

Communications of the ACM
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Stream processor architecture

Stream processor architecture
Imagine: Media Processing with Streams

IEEE Micro
A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Automatic storage optimization

SIGPLAN '79 Proceedings of the 1979 SIGPLAN symposium on Compiler construction
A programming system for the imagine media processor

A programming system for the imagine media processor
Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
SPRINT: a tool to generate concurrent transaction-level models from sequential code

EURASIP Journal on Applied Signal Processing
Hierarchical coarse-grained stream compilation for software defined radio

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Optimizing scientific application loops on stream processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Comparability graph coloring for optimizing utilization of stream register files in stream processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
MPSoC Design Using Application-Specific Architecturally Visible Communication

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Stream Compilation for Real-Time Embedded Multicore Systems

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
SRF coloring: stream register file allocation via graph coloring

Journal of Computer Science and Technology
SARA: StreAm register allocation

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
An analytical model to exploit memory task scheduling

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Application-guided tool development for architecturally diverse computation

Proceedings of the 2010 ACM Symposium on Applied Computing
Control flow emulation on tiled SIMD architectures

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Feedback-directed pipeline parallelism

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A taxonomy of accelerator architectures and their programming models

IBM Journal of Research and Development
Loop fusion and reordering for register file optimization on stream processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Simulation-based evaluation of the Imagine stream processor with scientific programs

International Journal of High Performance Computing and Networking
Adaptive task duplication using on-line bottleneck detection for streaming applications

Proceedings of the 9th conference on Computing Frontiers
Loop fusion and reordering for register file optimization on stream processors

Journal of Systems and Software
Riposte: a trace-driven compiler and parallel VM for vector code in R

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.