HARS: A hardware-assisted runtime software for embedded many-core architectures

Authors:
Yves Lhuillier;Maroun Ojail;Alexandre Guerre;Jean-Marc Philippe;Karim Ben Chehida;Farhat Thabet;Caaliph Andriamisaina;Chafic Jaber;Raphaël David
Affiliations:
CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST;CEA, LIST
Venue:
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Year:
2014

Citing 23
Cited 0

Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Memory system design for bus-based multiprocessors

Memory system design for bus-based multiprocessors
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Enabling scalability and performance in a large scale CMP environment

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Efficient dynamic heap allocation of scratch-pad memory

Proceedings of the 7th international symposium on Memory management
Software Standards for the Multicore Era

IEEE Micro
The Art of Multiprocessor Programming

The Art of Multiprocessor Programming
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
An analysis of Linux scalability to many cores

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost and energy-efficient distributed synchronization for embedded multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
Efficient synchronization for embedded on-chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Performance of multi-threaded execution in a shared-memory multiprocessor

SPDP '91 Proceedings of the 1991 Third IEEE Symposium on Parallel and Distributed Processing
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

Computing in Science and Engineering
Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications

Proceedings of the 49th Annual Design Automation Conference
An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture

Proceedings of the Conference on Design, Automation and Test in Europe
ARTM: a lightweight fork-join framework for many-core embedded systems

Proceedings of the Conference on Design, Automation and Test in Europe
Fast and lightweight support for nested parallelism on cluster-based embedded many-cores

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, cluster-based many-core accelerators with a shared hierarchical memory have emerged. Handling synchronizations on these architectures is critical since parallel implementations speed-ups of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This article presents the combination of a low-overhead complete runtime software and a flexible hardware accelerator for synchronizations called HARS (Hardware-Assisted Runtime Software). Experiments on a multicore test chip showed that the hardware accelerator for synchronizations has less than 1% area overhead compared to a cluster of the chip while reducing synchronization latencies (up to 2.8 times compared to a test-and-set implementation) and contentions. The runtime software part offers basic features like memory management but also optimized execution engines to allow the easy and efficient extraction of the parallelism in applications with multiple programming models. By using the hardware acceleration as well as a very low overhead task scheduling software technique, we show that HARS outperforms an optimized state-of-the-art task scheduler by 13% for the execution of a parallel application.