Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors

Authors:
Paul M. Carpenter;Alex Ramirez;Eduard Ayguadé
Affiliations:
Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain
Venue:
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Year:
2010

Citing 19
Cited 1

Fibonacci heaps and their uses in improved network optimization algorithms

Journal of the ACM (JACM)
Determining the minimum iteration period of an algorithm

Journal of VLSI Signal Processing Systems
Bounded scheduling of process networks

Bounded scheduling of process networks
A coupled hardware and software architecture for programmable digital signal processors (synchronous data flow)

A coupled hardware and software architecture for programmable digital signal processors (synchronous data flow)
Scheduling dynamic dataflow graphs with bounded memory using the token flow model

Scheduling dynamic dataflow graphs with bounded memory using the token flow model
Design and programming of embedded multiprocessors: an interface-centric approach

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
The Future of Microprocessors

Queue - Multiprocessors
Dynamic thread assignment on heterogeneous multiprocessor architectures

Proceedings of the 3rd conference on Computing frontiers
Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs

Proceedings of the 43rd annual Design Automation Conference
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
All-pairs bottleneck paths for general graphs in truly sub-cubic time

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stream Compilation for Real-Time Embedded Multicore Systems

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Mapping stream programs onto heterogeneous multiprocessor systems

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Requirements on the execution of Kahn process networks

ESOP'03 Proceedings of the 12th European conference on Programming
Faster maximum and minimum mean cycle algorithms for system-performance analysis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memories can be small. In this paper, we propose a feedback-directed algorithm that determines the size of each communication buffer, based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates. We show results for the StreamIt benchmarks and random graphs. For the StreamIt benchmarks, throughput is optimal after the first iteration. For random graphs with stochastic computation times, throughput is within 3% of optimal after four iterations. Compared with the previous general algorithm, by Basten and Hoogerbrugge, our algorithm has significantly better performance and latency.