Run-time scheduling and execution of loops on message passing machines
Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
Execution time support for adaptive scientific algorithms on distributed
Concurrency: Practice and Experience
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Runtime compilation techniques for data partitioning and communication schedule reuse
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An efficient hybrid dataflow architecture model
Journal of Parallel and Distributed Computing
Parallelizing molecular dynamics programs for distributed memory machines: an application of the CHAOS runtime support library
Software—Practice & Experience
Interprocedural compilation of irregular applications for distributed memory machines
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A design study of the EARTH multiprocessor
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Polling watchdog: combining polling and interrupts for efficient message handling
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A study of the EARTH-MANNA multithreaded system
International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques—part II
A framework for sparse matrix code synthesis from high-level specifications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Distributed Memory Compiler Design For Sparse Problems
IEEE Transactions on Computers
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
Latency Hiding in Message-Passing Architectures
Proceedings of the 8th International Symposium on Parallel Processing
Building Multithreaded Architectures with Off-the-Shelf Microprocessors
Proceedings of the 8th International Symposium on Parallel Processing
Next Generation System Software for Future High-End Computing Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Performance Portability on EARTH: A Case Study across Several Parallel Architectures
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Hi-index | 0.00 |
We report on our work in developing a fine-grained multithreaded solution for the communication-intensive Conjugate Gradient (CG) problem. In our recent work, we have developed a simple, yet very efficient, solution to executing matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is implemented and integrated with the sparse MVM resulting in a scalable implementation of the complete CG application. Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reduction-broadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3)By slowing down the netwok by a factor of 2, no notable degradation of overall CG performance was observed.