Tagged procedure calls (TPC): efficient runtime support for task-based parallelism on the cell processor

Authors:
George Tzenakis;Konstantinos Kapelonis;Michail Alvanos;Konstantinos Koukos;Dimitrios S. Nikolopoulos;Angelos Bilas
Affiliations:
Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece;Institute of Computer Science (ICS), Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece
Venue:
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Year:
2010

Citing 14
Cited 2

Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Accelerating computing with the cell broadband engine processor

Proceedings of the 5th conference on Computing frontiers
Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Supporting OpenMP on cell

International Journal of Parallel Programming
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Available task-level parallelism on the Cell BE

Scientific Programming - High Performance Computing with the Cell Broadband Engine
HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications

IISWC '07 Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Overview of the H.264/AVC video coding standard

IEEE Transactions on Circuits and Systems for Video Technology

Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Starsscheck: a tool to find errors in task-based parallel programs

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasing the number of cores in modern CPUs is the main trend for improving system performance. A central challenge is the runtime support that multi-core systems ought to use for sustaining high performance and scalability without increasing disproportionally the effort required by the programmer. In this work we present Tagged Procedure Calls (TPC), a runtime system for supporting task-based programming models on architectures that require explicit data access specification by the programmer. We present the design and implementation of TPC for the Cell processor and examine how the runtime system can support task management functions with on-chip communication only. Through minimizing off-chip transactions in the runtime, we achieve sub-microsecond task initiation latency and minimum null task initiation/completion latency of 385 ns. We evaluate TPC with several kernels and applications, demonstrating that TPC achieves scalable on-chip execution of codes previously parallelized and optimized for shared-memory multiprocessors, can exploit additional fine-grain parallelism in codes previously parallelized at coarse levels of granularity, and performs competitively to existing task-based parallel programming frameworks that statically optimize data layout and task placement.