Exploiting multilevel parallelism using OpenMP on a massive multithreaded architecture

Authors:
David Ró/denas;Xavier Martorell;Eduard Ayguadé/;Jesú/s Labarta;George Almá/si;Că/lin Caş/caval;José/ Castañ/os;José/ Moreira
Affiliations:
Barcelona Supercomputing Center, UPC, Campus Nord - C6, Jordi Girona 1-3, 08034 Barcelona, Spain;(Correspd. Tel.: +34 93 405 40 42/ Fax: +34 93 401 70 55/ E-mail: xavim@ac.upc.edu) Barcelona Supercomputing Center, UPC, Campus Nord - C6, Jordi Girona 1-3, 08034 Barcelona, Spain;Barcelona Supercomputing Center, UPC, Campus Nord - C6, Jordi Girona 1-3, 08034 Barcelona, Spain;Barcelona Supercomputing Center, UPC, Campus Nord - C6, Jordi Girona 1-3, 08034 Barcelona, Spain;IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598, USA;IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598, USA;IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598, USA;IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598, USA
Venue:
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Year:
2006

Citing 14
Cited 0

The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

ICS '99 Proceedings of the 13th international conference on Supercomputing
α-coral: a multigrain, multithreaded processor architecture

ICS '01 Proceedings of the 15th international conference on Supercomputing
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
A Library Implementation of the Nano-Threads Programming Model

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Predicate prediction for efficient out-of-order execution

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Performance Study of a Multithreaded Superscalar Microprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Design and implementation of the POWER5™ microprocessor

Proceedings of the 41st annual Design Automation Conference
Blue Gene: a vision for protein science using a petaflop supercomputer

IBM Systems Journal - Deep computing for the life sciences
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Evaluation of OpenMP for the cyclops multithreaded architecture

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper evaluates and analyzes multilevel parallelism on a chip multiprocessor (CMP) architecture. The environment is based on the experimental IBM BG/Cyclops architecture, where we have run the multi-zone parallel benchmarks. Multilevel parallelism is spawned using the Nanos OpenMP execution environment. We have performed the analysis with different execution parameters in order to evaluate different hardware threads distributions, cache utilization, and thread grouping configurations. Our results demonstrate that a large number of thread groups and good balancing algorithms are critical for high performance. We also show that a small number of threads can share the same data cache to increase the performance, but a large number of threads should better not share the same data caches.