An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Authors:
Eduard Ayguadé;Rosa M. Badia;Francisco D. Igual;Jesús Labarta;Rafael Mayo;Enrique S. Quintana-Ortí
Affiliations:
Barcelona Supercomputing Center --- Centro Nacional de Supercomputación, (BSC---CNS) and Universitat Politècnica de Catalunya, Barcelona, Spain 08034;Barcelona Supercomputing Center --- Centro Nacional de Supercomputación, (BSC---CNS) and Universitat Politècnica de Catalunya, Barcelona, Spain 08034 and Consejo Superior de Investigacio ...;Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), Castellón, Spain 12.071;Barcelona Supercomputing Center --- Centro Nacional de Supercomputación, (BSC---CNS) and Universitat Politècnica de Catalunya, Barcelona, Spain 08034;Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), Castellón, Spain 12.071;Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), Castellón, Spain 12.071
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 10
Cited 19

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Solving Dense Linear Systems on Graphics Processors

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
Multi-GPU and multi-CPU parallelization for interactive physics simulations

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
A scalable high performant Cholesky factorization for multicore with GPU accelerators

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Programming heterogeneous clusters with accelerators using object-based programming

Scientific Programming
Heterogeneous computing for vertebra detection and segmentation in x-ray images

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
Mapping a data-flow programming model onto heterogeneous platforms

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Automatic CUDA code synthesis framework for multicore CPU and GPU architectures

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A high-productivity task-based programming model for clusters

Concurrency and Computation: Practice & Experience
Exploring heterogeneous scheduling using the task-centric programming model

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
ViperVM: a runtime system for parallel functional high-performance computing on heterogeneous architectures

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Dynamic load balancing on heterogeneous multi-GPU systems

Computers and Electrical Engineering
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
Efficient implementation of data flow graphs on multi-gpu clusters

Journal of Real-Time Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer's productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.