Scalable hardware support for conditional parallelization

Authors:
Zheng Li;Olivier Certner;Jose Duato;Olivier Temam
Affiliations:
INRIA Saclay, Orsay, France;ST Microelectronics & INRIA Saclay, Orsay, France;Polytechnic University of Valencia, Valencia, Spain;INRIA Saclay, Orsay, France
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 22
Cited 0

MULTILISP: a language for concurrent symbolic computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Stanford Dash Multiprocessor

Computer
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient load balancing for wide-area divide-and-conquer applications

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Strategies for Dynamic Load Balancing on Highly Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Executing functional programs on a virtual tree of processors

FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
A Network on Chip Architecture and Design Methodology

ISVLSI '02 Proceedings of the IEEE Computer Society Annual Symposium on VLSI
Cilk: efficient multithreaded computing

Cilk: efficient multithreaded computing
The need for adaptive dynamic thread scheduling

High performance scientific and engineering computing
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Æthereal Network on Chip: Concepts, Architectures, and Implementations

IEEE Design & Test
Hardware-modulated parallelism in chip multiprocessors

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development

IEEE Computer Architecture Letters
Intel® threading building blocks

Journal of Computing Sciences in Colleges
A practical approach for reconciling high and predictable performance in non-regular parallel programs

Proceedings of the conference on Design, automation and test in Europe
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Quality-of-service and error control techniques for mesh-based network-on-chip architectures

Integration, the VLSI Journal - Special issue: ACM great lakes symposium on VLSI
A survey and comparison of wormhole routing techniques in a mesh networks

IEEE Network: The Magazine of Global Internetworking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel programming approaches based on task division/spawning are getting increasingly popular because they provide for a simple and elegant abstraction of parallelization, while achieving good performance on workloads which are traditionally complex to parallelize due to the complex control flow and data structures involved. The ability to quickly distribute fine-granularity tasks among many cores is key to the efficiency and scalability of such division-based parallel programming approaches. For this reason, several hardware supports for work stealing environments have already been proposed. However, they all rely on a central hardware structure for distributing tasks among cores, which hampers the scalability and efficiency of these schemes. In this paper, we focus on conditional division, a division-based parallel approach which provides the additional benefit, over work-stealing approaches, of releasing the user from dealing with task granularity and which does not clog hardware resources with an exceedingly large number of small tasks. For this type of division-based approaches, we show that it is possible to design hardware support for speeding up task division that entirely relies on local information, and which thus exhibits good scalability properties.