MULTILISP: a language for concurrent symbolic computation
ACM Transactions on Programming Languages and Systems (TOPLAS)
Control of parallelism in the Manchester Dataflow Machine
Proc. of a conference on Functional programming languages and computer architecture
PRESTO: a system for object-oriented parallel programming
Software—Practice & Experience
Resource requirements of dataflow programs
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Workcrews: an abstraction for controlling parallelism
International Journal of Parallel Programming
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
A simple load balancing scheme for task allocation in parallel machines
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Scheduler activations: effective kernel support for the user-level management of parallelism
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A customizable substrate for concurrent languages
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
C4.5: programs for machine learning
C4.5: programs for machine learning
Data locality and load balancing in COOL
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Journal of Parallel and Distributed Computing
Provably efficient scheduling for languages with fine-grained parallelism
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Whole-program optimization for time and space efficient threads
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
The performance implications of locality information usage in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Space-efficient implementation of nested parallelism
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel breadth-first BDD construction
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
First-class user-level threads
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Parallel hierarchical molecular structure estimation
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Earthquake ground motion modeling on parallel computers
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel data mining for association rules on shared-memory multi-processors
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Molecular dynamics simulation of large-scale carbon nanotubes on a shared-memory architecture
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
1998 IEEE Information Technology Conference
1998 IEEE Information Technology Conference
Storage Management in Virtual Tree Machines
IEEE Transactions on Computers
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Machine Learning
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Piecewise Execution of Nested Data-Parallel Programs
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
The Fastest Fourier Transform in the West
The Fastest Fourier Transform in the West
The Performance of Work Stealing in Multiprogrammed Environments
The Performance of Work Stealing in Multiprogrammed Environments
Scheduling threads for low space requirement and good locality
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Space-efficient scheduling of nested parallelism
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Multithreaded parallelism with OpenMP
High performance scientific and engineering computing
Hardware-modulated parallelism in chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Irregular computations in Fortran - expression and implementation strategies
Scientific Programming
A pilot study to compare programming effort for two parallel programming models
Journal of Systems and Software
Identifying the optimal energy-efficient operating points of parallel workloads
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.00 |
High performance applications on shared memory machines have typically been written in a coarse grained style, with one heavyweight thread per processor. In comparison, programming with a large number of lightweight, parallel threads has several advantages, including simpler coding for programs with irregular and dynamic parallelism, and better adaptability to a changing number of processors. The programmer can express a new thread to execute each individual parallel task; the implementation dynamically creates and schedules these threads onto the processors, and effectively balances the load. However, unless the threads scheduler is designed carefully, the parallel program may suffer poor space and time performance.In this paper, we study the performance of a native, lightweight POSIX threads (Pthreads) library on a shared memory machine running Solaris; to our knowledge, the Solaris library is one of the most efficient user-level implementations of the Pthreads standard available today. To evaluate this Pthreads implementation, we use a set of parallel programs that dynamically create a large number of threads. The programs include dense and sparse matrix multiplies, two N-body codes, a data classifier, a volume rendering benchmark, and a high performance FFT package. We find the existing threads scheduler to be unsuitable for executing such programs. We show how simple modifications to the Pthreads scheduler can result in significantly improved space and time performance for the programs; the modified scheduler results in as much as 44% less running time and 63% less memory requirement compared to the original Pthreads implementation. Our results indicate that, provided we use a good scheduler, the rich functionality and standard API of Pthreads can be combined with the advantages of dynamic, lightweight threads to result in high performance.