Static scheduling of synchronous data flow programs for digital signal processing
IEEE Transactions on Computers
The ESTEREL synchronous programming language: design, semantics, implementation
Science of Computer Programming
The SpectrumWare approach to wireless signal processing
Wireless Networks
POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software Synthesis from Dataflow Graphs
Software Synthesis from Dataflow Graphs
Synthesis of Embedded Software from Synchronous Dataflow Specifications
Journal of VLSI Signal Processing Systems
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data-Flow Synchronous Languages
A Decade of Concurrency, Reflections and Perspectives, REX School/Symposium
Phased scheduling of stream programs
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
The Sisal Model of Functional Programming and its Implementation
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications
Proceedings of the 12th international symposium on System synthesis
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Cache aware mapping of streaming applications on a multiprocessor system-on-chip
Proceedings of the conference on Design, automation and test in Europe
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
MPSoC Design Using Application-Specific Architecturally Visible Communication
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Synergistic execution of stream programs on multicores with accelerators
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
The canals language and its compiler
Proceedings of th 12th International Workshop on Software and Compilers for Embedded Systems
Instruction Hints for Super Efficient Data Caches
ICCS 2009 Proceedings of the 9th International Conference on Computational Science
SARA: StreAm register allocation
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Buffer sharing in CSP-like programs
MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Compiler assisted elliptic curve cryptography
OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Buffer sharing in rendezvous programs
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special section on the ACM IEEE international conference on formal methods and models for codesign (MEMOCODE) 2009
MPEG-2 decoding in a stream programming language
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Application-Tailored I/O with Streamline
ACM Transactions on Computer Systems (TOCS)
Cache-conscious scheduling of streaming applications
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
StreamPI: a stream-parallel programming extension for object-oriented programming languages
The Journal of Supercomputing
High-level support for pipeline parallelism on many-core architectures
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
StreamTMC: Stream compilation for tiled multi-core architectures
Journal of Parallel and Distributed Computing
Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Locality-aware task management for unstructured parallelism: a quantitative limit study
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Dynamic expressivity with static optimization for streaming languages
Proceedings of the 7th ACM international conference on Distributed event-based systems
Tutorial: stream processing optimizations
Proceedings of the 7th ACM international conference on Distributed event-based systems
11 PFLOP/s simulations of cloud cavitation collapse
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Maximum-throughput mapping of SDFGs on multi-core SoC platforms
Journal of Parallel and Distributed Computing
A catalog of stream processing optimizations
ACM Computing Surveys (CSUR)
Hi-index | 0.01 |
Effective use of the memory hierarchy is critical for achieving high performance on embedded systems. We focus on the class of streaming applications, which is increasingly prevalent in the embedded domain. We exploit the widespread parallelism and regular communication patterns in stream programs to formulate a set of cache aware optimizations that automatically improve instruction and data locality. Our work is in the context of the Synchronous Dataflow model, in which a program is described as a graph of independent actors that communicate over channels. The communication rates between actors are known at compile time, allowing the compiler to statically model the caching behavior.We present three cache aware optimizations: 1) execution scaling, which judiciously repeats actor executions to improve instruction locality, 2) cache aware fusion, which combines adjacent actors while respecting instruction cache constraints, and 3) scalar replacement, which converts certain data buffers into a sequence of scalar variables that can be register allocated. The optimizations are founded upon a simple and intuitive model that quantifies the temporal locality for a sequence of actor executions. Our implementation of cache aware optimizations in the StreamIt compiler yields a 249% average speedup (over unoptimized code) for our streaming benchmark suite on a StrongARM 1110 processor. The optimizations also yield a 154% speedup on a Pentium 3 and a 152% speedup on an Itanium 2.