Landing CG on EARTH: a case study of fine-grained multithreading on an evolutionary path

Authors:
Kevin B. Theobald;Gagan Agrawal;Rishi Kumar;Gerd Heber;Guang R. Gao;Paul Stodghill;Keshav Pingali
Affiliations:
Department of Electrical and Computer Engineering, University of Delaware;Department of Computer and Information Sciences, University of Delaware;Department of Electrical and Computer Engineering, University of Delaware;Cornell Theory Center, Cornell University;Department of Electrical and Computer Engineering, University of Delaware;Department of Computer Science, Cornell University;Department of Computer Science, Cornell University
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 17
Cited 3

Run-time scheduling and execution of loops on message passing machines

Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
Execution time support for adaptive scientific algorithms on distributed

Concurrency: Practice and Experience
Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An efficient hybrid dataflow architecture model

Journal of Parallel and Distributed Computing
Parallelizing molecular dynamics programs for distributed memory machines: an application of the CHAOS runtime support library

Parallelizing molecular dynamics programs for distributed memory machines: an application of the CHAOS runtime support library
Runtime and language support for compiling adaptive irregular programs on distributed-memory machines

Software—Practice & Experience
Interprocedural compilation of irregular applications for distributed memory machines

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A design study of the EARTH multiprocessor

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Polling watchdog: combining polling and interrupts for efficient message handling

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A study of the EARTH-MANNA multithreaded system

International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques—part II
A framework for sparse matrix code synthesis from high-level specifications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Distributed Memory Compiler Design For Sparse Problems

IEEE Transactions on Computers
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
Latency Hiding in Message-Passing Architectures

Proceedings of the 8th International Symposium on Parallel Processing
Building Multithreaded Architectures with Off-the-Shelf Microprocessors

Proceedings of the 8th International Symposium on Parallel Processing

Next Generation System Software for Future High-End Computing Systems

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Performance Portability on EARTH: A Case Study across Several Parallel Architectures

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Performance portability on EARTH: a case study across several parallel architectures

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on our work in developing a fine-grained multithreaded solution for the communication-intensive Conjugate Gradient (CG) problem. In our recent work, we have developed a simple, yet very efficient, solution to executing matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is implemented and integrated with the sparse MVM resulting in a scalable implementation of the complete CG application. Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reduction-broadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3)By slowing down the netwok by a factor of 2, no notable degradation of overall CG performance was observed.