A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: making it easier to program the cell broadband engine processor
IBM Journal of Research and Development
Solving Dense Linear Systems on Graphics Processors
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Multi-GPU and multi-CPU parallelization for interactive physics simulations
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
A scalable high performant Cholesky factorization for multicore with GPU accelerators
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Programming heterogeneous clusters with accelerators using object-based programming
Scientific Programming
Heterogeneous computing for vertebra detection and segmentation in x-ray images
Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Improving performance of adaptive component-based dataflow middleware
Parallel Computing
Mapping a data-flow programming model onto heterogeneous platforms
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Automatic CUDA code synthesis framework for multicore CPU and GPU architectures
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A high-productivity task-based programming model for clusters
Concurrency and Computation: Practice & Experience
Exploring heterogeneous scheduling using the task-centric programming model
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Dynamic load balancing on heterogeneous multi-GPU systems
Computers and Electrical Engineering
An application-centric evaluation of OpenCL on multi-core CPUs
Parallel Computing
Efficient implementation of data flow graphs on multi-gpu clusters
Journal of Real-Time Image Processing
Hi-index | 0.00 |
While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer's productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.