Cache-conscious scheduling of streaming applications

Authors:
Kunal Agrawal;Jeremy T. Fineman;Jordan Krage;Charles E. Leiserson;Sivan Toledo
Affiliations:
Washington University in Saint Louis, Saint Louis, MO, USA;Georgetown University, Washington D.C, USA;Washington University in Saint Louis, Saint Louis, MO, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Tel-Aviv University, Tel-Aviv, Israel
Venue:
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2012

Citing 23
Cited 1

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
The input/output complexity of sorting and related problems

Communications of the ACM
FFTs in external or hierarchical memory

The Journal of Supercomputing
A multilevel algorithm for partitioning graphs

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Extending the Hong-Kung Model to Memory Hierarchies

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Phased scheduling of stream programs

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Cache-oblivious B-trees

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Cache aware optimization of stream programs

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Efficient computation of buffer capacities for cyclo-static dataflow graphs

Proceedings of the 44th annual Design Automation Conference
Cache aware mapping of streaming applications on a multiprocessor system-on-chip

Proceedings of the conference on Design, automation and test in Europe
Wishbone: profile-based partitioning for sensornet applications

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix Partitioning Methods

SIAM Journal on Scientific Computing
Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Deadlock avoidance for streaming computations with filtering

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Binary Mesh Partitioning for Cache-Efficient Visualization

IEEE Transactions on Visualization and Computer Graphics
Orchestration by approximation: mapping stream programs onto multicore architectures

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Energy-aware scheduling for streaming applications

Energy-aware scheduling for streaming applications
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Mapping Filtering Streaming Applications

Algorithmica

Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers the problem of scheduling streaming applications on uniprocessors in order to minimize the number of cache-misses. Streaming applications are represented as a directed graph (or multigraph), where nodes are computation modules and edges are channels. When a module fires, it consumes some data-items from its input channels and produces some items on its output channels. In addition, each module may have some state (either code or data) which represents the memory locations that must be loaded into cache in order to execute the module. We consider synchronous dataflow graphs where the input and output rates of modules are known in advance and do not change during execution. We also assume that the state size of modules is known in advance. Our main contribution is to show that for a large and important class of streaming computations, cache-efficient scheduling is essentially equivalent to solving a constrained graph partitioning problem. A streaming computation from this class has a cache-efficient schedule if and only if its graph has a low-bandwidth partition of the modules into components (subgraphs) whose total state fits within the cache, where the bandwidth of the partition is the number of data items that cross intercomponent channels per data item that enters the graph. Given a good partition, we describe a runtime strategy for scheduling two classes of streaming graphs: pipelines, where the graph consists of a single directed chain, and a fairly general class of directed acyclic graphs (dags) with some additional restrictions. The runtime scheduling strategy consists of adding large external buffers at the input and output edges of each component, allowing each component to be executed many times. Partitioning enables a reduction in cache misses in two ways. First, any items that are generated on edges internal to subgraphs are never written out to memory, but remain in cache. Second, each subgraph is executed many times, allowing the state to be reused. We prove the optimality of this runtime scheduling for all pipelines and for dags that meet certain conditions on buffer-size requirements. Specifically, we show that with constant-factor memory augmentation, partitioning on these graphs guarantees the optimal number of cache misses to within a constant factor. For the pipeline case, we also prove that such a partition can be found in polynomial time. For the dags we prove optimality if a good partition is provided; the partitioning problem itself is NP-complete.