Toward high-throughput algorithms on many-core architectures

  • Authors:
  • Daniel Orozco;Elkin Garcia;Rishi Khan;Kelly Livingston;Guang R. Gao

  • Affiliations:
  • University of Delaware;University of Delaware;ET International;University of Delaware;University of Delaware

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of such architectures need to efficiently serve hundreds of processors at the same time, requiring all basic data structures within the runtime to maintain unprecedented throughput. In this paper, we analyze the throughput requirements that must be met by algorithms in runtime systems to be able to handle hundreds of simultaneous operations in real time. We reach a surprising conclusion: Many traditional algorithm techniques are poorly suited for highly parallel computing environments because of their low throughput. We reach the conclusion that the intrinsic throughput of a parallel program depends on both its algorithm and the processor architecture where the program runs. We provide theory to quantify the intrinsic throughput of algorithms, and we provide a few examples, where we describe the intrinsic throughput of existing, common algorithms. Then, we go on to explain how to follow a throughput-oriented approach to develop algorithms that have very high intrinsic throughput in many core architectures. We compare our throughput-oriented algorithms with other well known algorithms that provide the same functionality and we show that a throughput-oriented design produces algorithms with equal or faster performance in highly concurrent environments. We provide both theoretical and experimental evidence showing that our algorithms are excellent choices over other state of the art algorithms. The major contributions of this paper are (1) motivating examples that show the importance of throughput in concurrent algorithms; (2) a mathematical framework that uses queueing theory to describe the intrinsic throughput of algorithms; (3) two highly concurrent algorithms with very high intrinsic throughput that are useful for task management in runtime systems; and (4) extensive experimental and theoretical results that show that for highly parallel systems, our proposed algorithms allow greater or at least equal scalability and performance than other well-known similar state-of-the-art algorithms.