ACM Transactions on Programming Languages and Systems (TOPLAS)
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
YAPI: application modeling for signal processing systems
Proceedings of the 37th Annual Design Automation Conference
System-on-a-chip processor synchronization support in hardware
Proceedings of the conference on Design, automation and test in Europe
A practical tool box for system level communication synthesis
Proceedings of the ninth international symposium on Hardware/software codesign
StepNP: A System-Level Exploration Platform for Network Processors
IEEE Design & Test
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications
EDTC '97 Proceedings of the 1997 European conference on Design and Test
Computational graceful degradation for video sequence decoding
ICIP '97 Proceedings of the 1997 International Conference on Image Processing (ICIP '97) 3-Volume Set-Volume 1 - Volume 1
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe: Designers' Forum - Volume 2
picoArray Technology: The Tool's Story
Proceedings of the conference on Design, Automation and Test in Europe - Volume 3
Methods for evaluating and covering the design space during early design development
Integration, the VLSI Journal
The Challenges for High Performance Embedded Systems
DSD '06 Proceedings of the 9th EUROMICRO Conference on Digital System Design
LMPI: MPI for Heterogeneous Embedded Distributed Systems
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Concurrent programming without locks
ACM Transactions on Computer Systems (TOCS)
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Processor-Oblivious Parallel Stream Computations
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Adaptive work-stealing with parallelism feedback
ACM Transactions on Computer Systems (TOCS)
Proceedings of the conference on Design, automation and test in Europe
Comparison of memory write policies for NoC based multicore cache coherent systems
Proceedings of the conference on Design, automation and test in Europe
OpenMP-based parallelization on an MPCore multiprocessor platform - A performance and power analysis
Journal of Systems Architecture: the EUROMICRO Journal
Multisynchronous and Fully Asynchronous NoCs for GALS Architectures
IEEE Design & Test
Backtracking-based load balancing
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
OpenMP and compilation issue in embedded applications
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Hi-index | 0.00 |
During the past few years, embedded digital systems have been requested to provide a huge amount of processing power and functionality. A very likely foreseeable step to pursue this computational and flexibility trend is the generalization of on-chip multiprocessor platforms (MPSoC). In that context, choosing a programming model and providing optimized hardware support to it on these platforms is a challenging task. To deal in a portable way with MPSoCs having a different number of processors running possibly at different frequencies, work-stealing (WS) based parallelization is a current research trend. The contribution of this paper is to evaluate the impact of some simple MPSoCs' architecture characteristics on the performance of WS in the MPSoC context. The previous evaluations of WS, either theoretical or experimental, were done on fixed multicores architectures. This work extends these studies by exploring the use of WS for the codesign of embedded applications on MPSoC platforms with different hardware capabilities, thanks to cycle-accurate measures. We firstly study the architectural choices suited to WS algorithms and measure the benefit of these architectural modifications. To assert whether WS is suited to the MPSoC context, we experimentally measure its intrinsic implementation overhead on the most efficient architectural designs. Finally, we validate the performances of the approach on two real applications: a regular multimedia application (temporal noise reduction) and an irregular computation intensive application (frames of the Mandelbrot set). Our results show that enhancing MPSoC platforms having up to 16 processors with widespread hardware support mechanisms can lead to important performance improvements at acceptable hardware cost for the considered applications.