A practical approach for reconciling high and predictable performance in non-regular parallel programs

Authors:
Olivier Certner;Zheng Li;Pierre Palatin;Olivier Temam;Frederic Arzel;Nathalie Drach
Affiliations:
Alchemy Project, INRIA Saclay, France;Alchemy Project, INRIA Saclay, France;Alchemy Project, INRIA Saclay, France;Alchemy Project, INRIA Saclay, France;Pierre et Marie Curie University, Paris, France;Pierre et Marie Curie University, Paris, France
Venue:
Proceedings of the conference on Design, automation and test in Europe
Year:
2008

Citing 10
Cited 6

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipeline behavior prediction for superscalar processors by abstract interpretation

Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Timing Anomalies in Dynamically Scheduled Microprocessors

RTSS '99 Proceedings of the 20th IEEE Real-Time Systems Symposium
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Automatic performance model construction for the fast software exploration of new hardware designs

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison

Proceedings of the conference on Design, automation and test in Europe

Lazy binary-splitting: a run-time adaptive work-stealing scheduler

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Journal of Systems Architecture: the EUROMICRO Journal
Scalable hardware support for conditional parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Resource-aware programming and simulation of MPSoC architectures through extension of X10

Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems
Support for OpenMP tasks on cell architecture

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasingly complex consumer electronics applications call for embedded processors with higher performance. Multi-cores are capable of delivering the required performance. However, many of these embedded applications must meet some form of soft real-time constraints, and program behavior on multi-cores is even harder to predict than on single-cores. In this article, we highlight the greater performance variability of irregular applications (non-regular control flow and/or data structures) across data sets when parallelized and run on a multi-core. We then show that a proper parallelization approach coupled with a lightweight run-time system can drastically reduce this performance variability without sacrificing their performance. This approach requires no complex program or architecture analysis or modeling. Moreover, we show that parallel program performance becomes stable enough that it is possible to reasonably and accurately predict it by sampling a few training runs.