Multi-threading and one-sided communication in parallel LU factorization

Authors:
Parry Husbands;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;University of California at Berkeley, Berkeley, CA
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 19
Cited 10

Can dataflow subsume von Neumann computing?

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Multithreading: a revisionist view of dataflow architectures

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
The new hacker's dictionary (3rd ed.)

The new hacker's dictionary (3rd ed.)
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Space-Efficient Scheduling of Multithreaded Computations

SIAM Journal on Computing
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
MPI: The Complete Reference

MPI: The Complete Reference
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Cooperative Task Management Without Manual Stack Management

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
GASNet Specification, v1.1

GASNet Specification, v1.1
Performance and modularity benefits of message-driven execution

Journal of Parallel and Distributed Computing
UPC: Distributed Shared-Memory Programming

UPC: Distributed Shared-Memory Programming
The HPC Challenge (HPCC) benchmark suite

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Portable multithreading: the signal stack trick for user-space thread creation

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Hiding Communication Latency with Non-SPMD, Graph-Based Execution

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Composing parallel software efficiently with lithe

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Programming the Linpack benchmark for Roadrunner

IBM Journal of Research and Development
Lithe: enabling efficient composition of parallel libraries

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Concurrency and Computation: Practice & Experience
Bamboo: translating MPI applications to a latency-tolerant, data-driven form

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes.