Tolerating communication latency through dynamic thread invocation in a multithreaded architecture

  • Authors:
  • Andrew Sohn;Yuetsu Kodama;Jui-Yuan Ku;Mitsuhisa Sato;Yoshinori Yamaguchi

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • Compiler optimizations for scalable parallel systems
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Communication latency is a key parameter which affects the performance of distributed-memory multiprocessors. Instruction-level multithreading attempts to tolerate latency by overlapping communication with computation. This chapter explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hardware supports for dynamic function spawning and instruction-level multithreading. The supports include a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed-sized communication packets based on one-sided communication. Two problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Parameters that characterize the performance of multithreading are investigated, including the number of threads, the number of thread switches, the run length, and the number of remote reads. Experimental results indicate that the best communication performance occurs when the number of threads is two to four. A large number of threads of over eight is found inefficient and has adversely affected the overall performance. FFT yielded over 95% overlapping due to a large amount of computation and communication parallelism across threads. Even at the absence of thread computation parallelism, multithreading helps overlap over 35% of the communication time for bitonic sorting.