Toward high-throughput algorithms on many-core architectures

Authors:
Daniel Orozco;Elkin Garcia;Rishi Khan;Kelly Livingston;Guang R. Gao
Affiliations:
University of Delaware;University of Delaware;ET International;University of Delaware;University of Delaware
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Year:
2012

Citing 10
Cited 2

Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Memory access buffering in multiprocessors

25 years of the international symposia on Computer architecture (selected papers)
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Earth: an efficient architecture for running threads

Earth: an efficient architecture for running threads
Using elimination to implement scalable and lock-free FIFO queues

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Theory, Volume 1, Queueing Systems

Theory, Volume 1, Queueing Systems
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture

HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II

Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures

Proceedings of the 9th conference on Computing Frontiers
An efficient unbounded lock-free queue for multi-core systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of such architectures need to efficiently serve hundreds of processors at the same time, requiring all basic data structures within the runtime to maintain unprecedented throughput. In this paper, we analyze the throughput requirements that must be met by algorithms in runtime systems to be able to handle hundreds of simultaneous operations in real time. We reach a surprising conclusion: Many traditional algorithm techniques are poorly suited for highly parallel computing environments because of their low throughput. We reach the conclusion that the intrinsic throughput of a parallel program depends on both its algorithm and the processor architecture where the program runs. We provide theory to quantify the intrinsic throughput of algorithms, and we provide a few examples, where we describe the intrinsic throughput of existing, common algorithms. Then, we go on to explain how to follow a throughput-oriented approach to develop algorithms that have very high intrinsic throughput in many core architectures. We compare our throughput-oriented algorithms with other well known algorithms that provide the same functionality and we show that a throughput-oriented design produces algorithms with equal or faster performance in highly concurrent environments. We provide both theoretical and experimental evidence showing that our algorithms are excellent choices over other state of the art algorithms. The major contributions of this paper are (1) motivating examples that show the importance of throughput in concurrent algorithms; (2) a mathematical framework that uses queueing theory to describe the intrinsic throughput of algorithms; (3) two highly concurrent algorithms with very high intrinsic throughput that are useful for task management in runtime systems; and (4) extensive experimental and theoretical results that show that for highly parallel systems, our proposed algorithms allow greater or at least equal scalability and performance than other well-known similar state-of-the-art algorithms.