Space and time efficient execution of parallel irregular computations

Authors:
Cong Fu;Tao Yang
Affiliations:
Department of Computer Science, University of California, Santa Barbara, CA;Department of Computer Science, University of California, Santa Barbara, CA
Venue:
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1997

Citing 16
Cited 5

Run-time scheduling and execution of loops on message passing machines

Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
Program partitioning for NUMA multiprocessor computer systems

Journal of Parallel and Distributed Computing - Special issue on performance of supercomputers
List scheduling with and without communication delays

Parallel Computing
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Modeling the benefits of mixed data and task parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Decoupling synchronization and data transfer in message passing systems of parallel computers

ICS '95 Proceedings of the 9th international conference on Supercomputing
Run-time compilation for parallel sparse matrix computations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Run-time techniques for exploiting irregular task parallelism on distributed memory architectures

Journal of Parallel and Distributed Computing
Sparse LU factorization with partial pivoting on distributed memory machines

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Partitioning and Scheduling Parallel Programs for Multiprocessors

Partitioning and Scheduling Parallel Programs for Multiprocessors
Parallel Programming and Compilers

Parallel Programming and Compilers
Improved load distribution in parallel sparse cholesky factorization

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Automatic Extraction of Functional Parallelism from Ordinary Programs

IEEE Transactions on Parallel and Distributed Systems
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors

IEEE Transactions on Parallel and Distributed Systems
Sparse gaussian elimination on high-performance computers

Sparse gaussian elimination on high-performance computers

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Elimination forest guided 2D sparse LU factorization

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task dependence graphs with mixed granularities. The trade-off in achieving both time and space efficiency is investigated. The main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. This work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. The experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.