DIB—a distributed implementation of backtracking
ACM Transactions on Programming Languages and Systems (TOPLAS)
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Mul-T: a high-performance parallel Lisp
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Space-Efficient Scheduling of Multithreaded Computations
SIAM Journal on Computing
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Konrad Zuse's Legacy: The Architecture of the Z1 and Z3
IEEE Annals of the History of Computing
Executing functional programs on a virtual tree of processors
FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
The incremental garbage collection of processes
Proceedings of the 1977 symposium on Artificial intelligence and programming languages
Cg: a system for programming graphics hardware in a C-like language
ACM SIGGRAPH 2003 Papers
Spidle: a DSL approach to specifying streaming applications
Proceedings of the 2nd international conference on Generative programming and component engineering
Decoupled Software Pipelining with the Synchronization Array
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Aspects of Applicative Programming for Parallel Processing
IEEE Transactions on Computers
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Burroughs' B6500/B7500 stack mechanism
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
Analytical Modeling of Pipeline Parallelism
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
The Cilk++ concurrency platform
The Journal of Supercomputing
The Cilkview scalability analyzer
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Feedback-directed pipeline parallelism
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Using memory mapping to support cactus stacks in work-stealing runtime systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Parallel programming must be deterministic by default
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
A stream-computing extension to OpenMP
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Expressing pipeline parallelism using TBB constructs: a case study on what works and what doesn't
Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Dynamic Fine-Grain Scheduling of Pipeline Parallelism
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Characteristics of workloads using the pipeline programming model
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Overview of the H.264/AVC video coding standard
IEEE Transactions on Circuits and Systems for Video Technology
Structured Parallel Programming: Patterns for Efficient Computation
Structured Parallel Programming: Patterns for Efficient Computation
Well-structured futures and cache locality
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.