On-the-fly pipeline parallelism

Authors:
I-Ting Angelina Lee;Charles E. Leiserson;Tao B. Schardl;Jim Sukha;Zhunping Zhang
Affiliations:
Massachusettes Institute of Technology CSAIL, Cambridge, MA, USA;Massachusettes Institute of Technology CSAIL, Cambridge, MA, USA;Massachusettes Institute of Technology CSAIL, Cambridge, MA, USA;Intel Corporation, Merrimack, NH, USA;Massachusettes Institute of Technology CSAIL, Cambridge, MA, USA
Venue:
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2013

Citing 33
Cited 1

DIB—a distributed implementation of backtracking

ACM Transactions on Programming Languages and Systems (TOPLAS)
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Mul-T: a high-performance parallel Lisp

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Pipelining with futures

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Konrad Zuse's Legacy: The Architecture of the Z1 and Z3

IEEE Annals of the History of Computing
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
The incremental garbage collection of processes

Proceedings of the 1977 symposium on Artificial intelligence and programming languages
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Spidle: a DSL approach to specifying streaming applications

Proceedings of the 2nd international conference on Generative programming and component engineering
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Aspects of Applicative Programming for Parallel Processing

IEEE Transactions on Computers
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Burroughs' B6500/B7500 stack mechanism

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Analytical Modeling of Pipeline Parallelism

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
The Cilk++ concurrency platform

The Journal of Supercomputing
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Feedback-directed pipeline parallelism

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Using memory mapping to support cactus stacks in work-stealing runtime systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Parallel programming must be deterministic by default

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
A stream-computing extension to OpenMP

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Expressing pipeline parallelism using TBB constructs: a case study on what works and what doesn't

Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Dynamic Fine-Grain Scheduling of Pipeline Parallelism

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Characteristics of workloads using the pipeline programming model

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Overview of the H.264/AVC video coding standard

IEEE Transactions on Circuits and Systems for Video Technology
Structured Parallel Programming: Patterns for Efficient Computation

Structured Parallel Programming: Patterns for Efficient Computation

Well-structured futures and cache locality

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.