Tolerating communication latency through dynamic thread invocation in a multithreaded architecture

Authors:
Andrew Sohn;Yuetsu Kodama;Jui-Yuan Ku;Mitsuhisa Sato;Yoshinori Yamaguchi
Affiliations:
-;-;-;-;-
Venue:
Compiler optimizations for scalable parallel systems
Year:
2001

Citing 17
Cited 0

An architecture of a dataflow single chip processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Implementation of a general-purpose dataflow multiprocessor

Implementation of a general-purpose dataflow multiprocessor
Thread-based programming for the EM-4 hybrid dataflow machine

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
TAM—a compiler controlled threaded abstract machine

Journal of Parallel and Distributed Computing - Special issue on dataflow and multithreaded architectures
EMC-Y: parallel processing element optimizing communication and computation

ICS '93 Proceedings of the 7th international conference on Supercomputing
SP2 system architecture

IBM Systems Journal
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The EM-X parallel computer: architecture and basic performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data and workload distribution in a multithreaded architecture

Journal of Parallel and Distributed Computing
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Advanced Topics in Dataflow Computing and Multithreading

Advanced Topics in Dataflow Computing and Multithreading
Identifying the Capability of Overlapping Computation with Communication

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication latency is a key parameter which affects the performance of distributed-memory multiprocessors. Instruction-level multithreading attempts to tolerate latency by overlapping communication with computation. This chapter explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hardware supports for dynamic function spawning and instruction-level multithreading. The supports include a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed-sized communication packets based on one-sided communication. Two problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Parameters that characterize the performance of multithreading are investigated, including the number of threads, the number of thread switches, the run length, and the number of remote reads. Experimental results indicate that the best communication performance occurs when the number of threads is two to four. A large number of threads of over eight is found inefficient and has adversely affected the overall performance. FFT yielded over 95% overlapping due to a large amount of computation and communication parallelism across threads. Even at the absence of thread computation parallelism, multithreading helps overlap over 35% of the communication time for bitonic sorting.