An architecture of a dataflow single chip processor
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing
SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Implementation of a general-purpose dataflow multiprocessor
Implementation of a general-purpose dataflow multiprocessor
Thread-based programming for the EM-4 hybrid dataflow machine
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
T: a multithreaded massively parallel architecture
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
TAM—a compiler controlled threaded abstract machine
Journal of Parallel and Distributed Computing - Special issue on dataflow and multithreaded architectures
EMC-Y: parallel processing element optimizing communication and computation
ICS '93 Proceedings of the 7th international conference on Supercomputing
IBM Systems Journal
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The EM-X parallel computer: architecture and basic performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data and workload distribution in a multithreaded architecture
Journal of Parallel and Distributed Computing
ICS '90 Proceedings of the 4th international conference on Supercomputing
Advanced Topics in Dataflow Computing and Multithreading
Advanced Topics in Dataflow Computing and Multithreading
Identifying the Capability of Overlapping Computation with Communication
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Sorting networks and their applications
AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Hi-index | 0.00 |
Communication latency is a key parameter which affects the performance of distributed-memory multiprocessors. Instruction-level multithreading attempts to tolerate latency by overlapping communication with computation. This chapter explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hardware supports for dynamic function spawning and instruction-level multithreading. The supports include a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed-sized communication packets based on one-sided communication. Two problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Parameters that characterize the performance of multithreading are investigated, including the number of threads, the number of thread switches, the run length, and the number of remote reads. Experimental results indicate that the best communication performance occurs when the number of threads is two to four. A large number of threads of over eight is found inefficient and has adversely affected the overall performance. FFT yielded over 95% overlapping due to a large amount of computation and communication parallelism across threads. Even at the absence of thread computation parallelism, multithreading helps overlap over 35% of the communication time for bitonic sorting.