Solving dense linear systems on platforms with multiple hardware accelerators

Authors:
Gregorio Quintana-Ortí;Francisco D. Igual;Enrique S. Quintana-Ortí;Robert A. van de Geijn
Affiliations:
Universidad Jaime I, Castellon, Spain;Universidad Jaime I, Castellon, Spain;Universidad Jaime I, Castellon, Spain;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 21
Cited 15

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
A systematic approach to the design and analysis of linear algebra algorithms

A systematic approach to the design and analysis of linear algebra algorithms
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Industry Trends: Chip Makers Turn to Multicore Processors

Computer
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Mechanical derivation and systematic analysis of correct linear algebra algorithms

Mechanical derivation and systematic analysis of correct linear algebra algorithms
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Satisfying your dependencies with SuperMatrix

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Performance Optimization Strategies of High Performance Computing on GPU

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
GPU based sparse grid technique for solving multidimensional options pricing PDEs

Proceedings of the 2nd Workshop on High Performance Computational Finance
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Techniques for the parallelization of unstructured grid applications on multi-GPU systems

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Encapsulated synchronization and load-balance in heterogeneous programming

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.