Feedback-directed pipeline parallelism

Authors:
M. Aater Suleman;Moinuddin K. Qureshi; Khubaib;Yale N. Patt
Affiliations:
The University of Texas at Austin, Austin, TX, USA;IBM T. J. Watson Research Center, Yorktown Heights, NY, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 14
Cited 14

A feedback-driven proportion allocator for real-rate scheduling

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Using Cohort Scheduling to Enhance Server Performance (Extended Abstract)

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Towards the automatic optimal mapping of pipeline algorithms

Parallel Computing
Performance Evaluation of a Parallel Pipeline Computational Model for Space-Time Adaptive Processing

The Journal of Supercomputing
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallel-stage decoupled software pipelining

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Load balancing using work-stealing for pipeline parallelism in emerging applications

Proceedings of the 23rd international conference on Supercomputing
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Analytical Modeling of Pipeline Parallelism

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques

Parallelism orchestration using DoPE: the degree of parallelism executive

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Parallel pattern detection for architectural improvements

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Petri-nets as an intermediate representation for heterogeneous architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Bottleneck identification and scheduling in multithreaded applications

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Parcae: a system for flexible parallel execution

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
High-level support for pipeline parallelism on many-core architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Utility-based acceleration of multithreaded applications on asymmetric CMPs

Proceedings of the 40th Annual International Symposium on Computer Architecture
On-the-fly pipeline parallelism

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Adaptive online scheduling in storm

Proceedings of the 7th ACM international conference on Distributed event-based systems
Deterministic scale-free pipeline parallelism with hyperqueues

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable high-performance parallel design for network intrusion detection systems on many-core processors

ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.