An implementation of the codelet model

Authors:
Joshua Suettlerlein;Stéphane Zuckerman;Guang R. Gao
Affiliations:
University of Delaware, Newark, DE;University of Delaware, Newark, DE;University of Delaware, Newark, DE
Venue:
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Year:
2013

Citing 18
Cited 0

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
First version of a data flow procedure language

Programming Symposium, Proceedings Colloque sur la Programmation
Earth: an efficient architecture for running threads

Earth: an efficient architecture for running threads
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The Problem with Threads

Computer
Intel® threading building blocks

Journal of Computing Sciences in Colleges
The Cilk++ concurrency platform

Proceedings of the 46th Annual Design Automation Conference
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Ordered and unordered algorithms for parallel breadth first search

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Concurrent Collections

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Using a "codelet" program execution model for exascale machines: position paper

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Data-Driven Tasks and Their Implementation

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Towards a codelet-based runtime for exascale computing: position paper

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures

Proceedings of the 9th conference on Computing Frontiers
For extreme parallelism, your OS is Sooooo last-millennium

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we propose a novel program execution model, the Codelet model, which is designed for new systems tasked with efficiently managing varying resources. The Codelet model is a fine-grained dataflow inspired model extended to address the cumbersome resources available in new architectures. In the following, we define the Codelet execution model as well as provide an implementation named DARTS. Utilizing DARTS and two predominant kernels, matrix multiplication and the Graph 500's breadth first search, we explore the validity of fine-grain execution as a promising and viable execution model for future and current architectures. We show that our runtime is on par or performs better than AMD's highly-optimized parallel library for matrix multication, outperforming it on average by 1.40× with a speedup up to 4×. Our implementation of the parallel BFS outperforms Graph 500's reference implementation (with or without dynamic scheduling) on average by 1.50× with a speed up of up to 2.38×.