Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Authors:
Fengguang Song;Asim YarKhan;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee and University of Manchester
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 12
Cited 19

The torus-wrap mapping for dense matrix calculations on massively parallel computers

SIAM Journal on Scientific Computing
ScaLAPACK user's guide

ScaLAPACK user's guide
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
The roadrunner supercomputer: a petaflop's no problem

Linux Journal
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing

Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Run-time optimizations for replicated dataflows on heterogeneous environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Load balancing for regular meshes on SMPs with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Hierarchical scheduling of DAG structured computations on manycore processors with dynamic thread grouping

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Task Superscalar: An Out-of-Order Task Pipeline

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
TRACON: interference-aware scheduling for data-intensive applications in virtualized environments

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition

SIAM Journal on Matrix Analysis and Applications
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
A dependency-driven formulation of parareal: parallel-in-time solution of PDEs as a many-task application

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
Optimizing dataflow applications on heterogeneous environments

Cluster Computing
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Work-stealing with configurable scheduling strategies

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Expressing graph algorithms using generalized active messages

Proceedings of the 27th international ACM conference on International conference on supercomputing
Work-efficient matrix inversion in polylogarithmic time

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-memory or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-memory machines (16, 32 cores) and distributed-memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.