Automatic generation of software pipelines for heterogeneous parallel systems

Authors:
Jacques A. Pienaar;Srimat Chakradhar;Anand Raghunathan
Affiliations:
Purdue University, West Lafayette, IN;NEC Laboratories America, Princeton, NJ;Purdue University, West Lafayette, IN
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 49
Cited 2

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A vector space model for automatic indexing

Communications of the ACM
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Perfect Pipelining: A New Loop Parallelization Technique

ESOP '88 Proceedings of the 2nd European Symposium on Programming
Lithium: A Structured Parallel Programming Environment in Java

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Dag-Consistent Distributed Shared Memory

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
From patterns to frameworks to parallel programs

Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
Using generative design patterns to generate parallel code for a distributed memory environment

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming

Parallel Computing
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reengineering for Parallelism: an entry point into PLPP for legacy applications: Research Articles

Concurrency and Computation: Practice & Experience
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Parallel-stage decoupled software pipelining

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Patterns for parallel programming

Patterns for parallel programming
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Parallel Computing Experiences with CUDA

IEEE Micro
Amdahl's Law in the Multicore Era

Computer
CellSs: Scheduling techniques to better exploit memory hierarchy

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
The design of a task parallel library

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Best-effort semantic document search on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A dynamically configurable coprocessor for convolutional neural networks

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Feedback-directed pipeline parallelism

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Structured parallel programming with deterministic patterns

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Parallel programming of general-purpose programs using task-based programming models

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
LEMON - an Open Source C++ Graph Template Library

Electronic Notes in Theoretical Computer Science (ENTCS)
Extending synchronization constructs in openMP to exploit pipeline parallelism on heterogeneous multi-core

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Dynamic Fine-Grain Scheduling of Pipeline Parallelism

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Characteristics of workloads using the pipeline programming model

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

Computing in Science and Engineering

Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pipelining is a well-known approach to increasing parallelism and performance. We address the problem of software pipelining for heterogeneous parallel platforms that consist of different multi-core and many-core processing units. In this context, pipelining involves two key steps---partitioning an application into stages and mapping and scheduling the stages onto the processing units of the heterogeneous platform. We show that the inter-dependency between these steps is a critical challenge that must be addressed in order to achieve high performance. We propose an Automatic Heterogeneous Pipelining framework (ahp) that generates an optimized pipelined implementation of a program from an annotated unpipelined specification. Across three complex applications (image classification, object detection, and document retrieval) and two heterogeneous platforms (Intel Xeon multi-core CPUs with Intel MIC and NVIDIA GPGPU accelerators), ahp achieves a throughput improvement of up to 1.53x (1.37x on average) over a heterogeneous baseline that exploits data and task parallelism.