The torus-wrap mapping for dense matrix calculations on massively parallel computers
SIAM Journal on Scientific Computing
ScaLAPACK user's guide
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
IBM Journal of Research and Development
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
The roadrunner supercomputer: a petaflop's no problem
Linux Journal
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
Proceedings of the 24th ACM International Conference on Supercomputing
Run-time optimizations for replicated dataflows on heterogeneous environments
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Task superscalar: using processors as functional units
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Load balancing for regular meshes on SMPs with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Task Superscalar: An Out-of-Order Task Pipeline
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
TRACON: interference-aware scheduling for data-intensive applications in virtualized environments
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition
SIAM Journal on Matrix Analysis and Applications
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Impact of over-decomposition on coordinated checkpoint/rollback protocol
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
Optimizing dataflow applications on heterogeneous environments
Cluster Computing
From serial loops to parallel execution on distributed systems
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Work-stealing with configurable scheduling strategies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Expressing graph algorithms using generalized active messages
Proceedings of the 27th international ACM conference on International conference on supercomputing
Work-efficient matrix inversion in polylogarithmic time
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.00 |
This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-memory or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-memory machines (16, 32 cores) and distributed-memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.