Scalable framework for mapping streaming applications onto multi-GPU systems

Authors:
Huynh Phung Huynh;Andrei Hagiescu;Weng-Fai Wong;Rick Siow Mong Goh
Affiliations:
A*STAR Institute of High Performance Computing, Singapore, Singapore;School of Computing, National University of Singapore, Singapore, Singapore;School of Computing, National University of Singapore, Singapore, Singapore;A*STAR Institute of High Performance Computing, Singapore, Singapore
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 16
Cited 1

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
Multilevel k-way partitioning scheme for irregular graphs

Journal of Parallel and Distributed Computing
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A linear-time heuristic for improving network partitions

DAC '82 Proceedings of the 19th Design Automation Conference
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Exploring the multiple-GPU design space

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A computing origami: folding streams in FPGAs

Proceedings of the 46th Annual Design Automation Conference
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The GPU Computing Era

IEEE Micro
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Multi-GPU MapReduce on GPU Clusters

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.