Scalable framework for mapping streaming applications onto multi-GPU systems

  • Authors:
  • Huynh Phung Huynh;Andrei Hagiescu;Weng-Fai Wong;Rick Siow Mong Goh

  • Affiliations:
  • A*STAR Institute of High Performance Computing, Singapore, Singapore;School of Computing, National University of Singapore, Singapore, Singapore;School of Computing, National University of Singapore, Singapore, Singapore;A*STAR Institute of High Performance Computing, Singapore, Singapore

  • Venue:
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.