Pthreads for dynamic and irregular parallelism

Authors:
Girija J. Narlikar;Guy E. Blelloch
Affiliations:
CMU School of Computer Science, Pittsburgh, PA;CMU School of Computer Science, Pittsburgh, PA
Venue:
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Year:
1998

Citing 37
Cited 8

MULTILISP: a language for concurrent symbolic computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Control of parallelism in the Manchester Dataflow Machine

Proc. of a conference on Functional programming languages and computer architecture
PRESTO: a system for object-oriented parallel programming

Software—Practice & Experience
Resource requirements of dataflow programs

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Workcrews: an abstraction for controlling parallelism

International Journal of Parallel Programming
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
A simple load balancing scheme for task allocation in parallel machines

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Scheduler activations: effective kernel support for the user-level management of parallelism

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A customizable substrate for concurrent languages

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
C4.5: programs for machine learning

C4.5: programs for machine learning
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Provably efficient scheduling for languages with fine-grained parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Whole-program optimization for time and space efficient threads

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
The performance implications of locality information usage in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Space-efficient implementation of nested parallelism

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel breadth-first BDD construction

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
First-class user-level threads

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Parallel hierarchical molecular structure estimation

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Earthquake ground motion modeling on parallel computers

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel data mining for association rules on shared-memory multi-processors

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Molecular dynamics simulation of large-scale carbon nanotubes on a shared-memory architecture

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
1998 IEEE Information Technology Conference

1998 IEEE Information Technology Conference
Storage Management in Virtual Tree Machines

IEEE Transactions on Computers
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Induction of Decision Trees

Machine Learning
Early Experiences with Olden

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Piecewise Execution of Nested Data-Parallel Programs

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West
The Performance of Work Stealing in Multiprogrammed Environments

The Performance of Work Stealing in Multiprogrammed Environments

Scheduling threads for low space requirement and good locality

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
Low-contention depth-first scheduling of parallel computations with write-once synchronization variables

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Multithreaded parallelism with OpenMP

High performance scientific and engineering computing
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Irregular computations in Fortran - expression and implementation strategies

Scientific Programming
A pilot study to compare programming effort for two parallel programming models

Journal of Systems and Software
Identifying the optimal energy-efficient operating points of parallel workloads

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance applications on shared memory machines have typically been written in a coarse grained style, with one heavyweight thread per processor. In comparison, programming with a large number of lightweight, parallel threads has several advantages, including simpler coding for programs with irregular and dynamic parallelism, and better adaptability to a changing number of processors. The programmer can express a new thread to execute each individual parallel task; the implementation dynamically creates and schedules these threads onto the processors, and effectively balances the load. However, unless the threads scheduler is designed carefully, the parallel program may suffer poor space and time performance.In this paper, we study the performance of a native, lightweight POSIX threads (Pthreads) library on a shared memory machine running Solaris; to our knowledge, the Solaris library is one of the most efficient user-level implementations of the Pthreads standard available today. To evaluate this Pthreads implementation, we use a set of parallel programs that dynamically create a large number of threads. The programs include dense and sparse matrix multiplies, two N-body codes, a data classifier, a volume rendering benchmark, and a high performance FFT package. We find the existing threads scheduler to be unsuitable for executing such programs. We show how simple modifications to the Pthreads scheduler can result in significantly improved space and time performance for the programs; the modified scheduler results in as much as 44% less running time and 63% less memory requirement compared to the original Pthreads implementation. Our results indicate that, provided we use a good scheduler, the rich functionality and standard API of Pthreads can be combined with the advantages of dynamic, lightweight threads to result in high performance.