Cache aware optimization of stream programs

Authors:
Janis Sermulins;William Thies;Rodric Rabbah;Saman Amarasinghe
Affiliations:
Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology;Massachusetts Institute of Technology
Venue:
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Year:
2005

Citing 12
Cited 25

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
The ESTEREL synchronous programming language: design, semantics, implementation

Science of Computer Programming
The SpectrumWare approach to wireless signal processing

Wireless Networks
Filter fusion

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software Synthesis from Dataflow Graphs

Software Synthesis from Dataflow Graphs
Synthesis of Embedded Software from Synchronous Dataflow Specifications

Journal of VLSI Signal Processing Systems
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data-Flow Synchronous Languages

A Decade of Concurrency, Reflections and Perspectives, REX School/Symposium
Phased scheduling of stream programs

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
The Sisal Model of Functional Programming and its Implementation

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications

Proceedings of the 12th international symposium on System synthesis
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Cache aware mapping of streaming applications on a multiprocessor system-on-chip

Proceedings of the conference on Design, automation and test in Europe
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
MPSoC Design Using Application-Specific Architecturally Visible Communication

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
The canals language and its compiler

Proceedings of th 12th International Workshop on Software and Compilers for Embedded Systems
Instruction Hints for Super Efficient Data Caches

ICCS 2009 Proceedings of the 9th International Conference on Computational Science
SARA: StreAm register allocation

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Buffer sharing in CSP-like programs

MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
Optimizing stream organization to improve the performance of scientific computing applications on the stream processor

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Compiler assisted elliptic curve cryptography

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Buffer sharing in rendezvous programs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special section on the ACM IEEE international conference on formal methods and models for codesign (MEMOCODE) 2009
MPEG-2 decoding in a stream programming language

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Application-Tailored I/O with Streamline

ACM Transactions on Computer Systems (TOCS)
Cache-conscious scheduling of streaming applications

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
StreamPI: a stream-parallel programming extension for object-oriented programming languages

The Journal of Supercomputing
High-level support for pipeline parallelism on many-core architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
StreamTMC: Stream compilation for tiled multi-core architectures

Journal of Parallel and Distributed Computing
Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Dynamic expressivity with static optimization for streaming languages

Proceedings of the 7th ACM international conference on Distributed event-based systems
Tutorial: stream processing optimizations

Proceedings of the 7th ACM international conference on Distributed event-based systems
11 PFLOP/s simulations of cloud cavitation collapse

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Maximum-throughput mapping of SDFGs on multi-core SoC platforms

Journal of Parallel and Distributed Computing
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Effective use of the memory hierarchy is critical for achieving high performance on embedded systems. We focus on the class of streaming applications, which is increasingly prevalent in the embedded domain. We exploit the widespread parallelism and regular communication patterns in stream programs to formulate a set of cache aware optimizations that automatically improve instruction and data locality. Our work is in the context of the Synchronous Dataflow model, in which a program is described as a graph of independent actors that communicate over channels. The communication rates between actors are known at compile time, allowing the compiler to statically model the caching behavior.We present three cache aware optimizations: 1) execution scaling, which judiciously repeats actor executions to improve instruction locality, 2) cache aware fusion, which combines adjacent actors while respecting instruction cache constraints, and 3) scalar replacement, which converts certain data buffers into a sequence of scalar variables that can be register allocated. The optimizations are founded upon a simple and intuitive model that quantifies the temporal locality for a sequence of actor executions. Our implementation of cache aware optimizations in the StreamIt compiler yields a 249% average speedup (over unoptimized code) for our streaming benchmark suite on a StrongARM 1110 processor. The optimizations also yield a 154% speedup on a Pentium 3 and a 152% speedup on an Itanium 2.