Hardware support for OpenMP collective operations

Authors:
Soohong P. Kim;Samuel P. Midkiff;Henry G. Dietz
Affiliations:
School of ECE, Purdue University;School of ECE, Purdue University;Department of ECE, University of Kentucky
Venue:
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Year:
2009

Citing 23
Cited 0

The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance characteristics of the SPEC OMP2001 benchmarks

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
A microbenchmark suite for OpenMP 2.0

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Simics: A Full System Simulation Platform

Computer
The Aggregate Function API: It's Not Just for PAPERS Anymore

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
A Parallel Processing Support Library Based on Synchronized Aggregate Communication

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Efficient Barrier Using Remote Memory Operations on VIA-Based Clusters

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Bitwise Aggregate Networks

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
CEDAR: a large scale multiprocessor

ACM SIGARCH Computer Architecture News
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Large system performance of SPEC OMP benchmark suites

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

IEEE Transactions on Computers
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Chip multiprocessors with on-chip aggregate function network

Chip multiprocessors with on-chip aggregate function network

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient implementation of OpenMP collective operations (e.g. barriers and reductions) is essential for good performance from OpenMP programs. State-of-the-art on-chip networks and block-based cache coherence protocols used in shared memory Chip MultiProcessors (CMPs) are inefficient for implementing these collective operations. The performance of CMPs can be seriously degraded by the multitude of memory requests and coherence messages required to implement collective operations. To provide efficient support for OpenMP collective operations, this paper presents a CMP-AFN architecture and Instruction Set Architecture (ISA) extensions that augment a conventional shared-memory CMP with a tightly-integrated Aggregate Function Network (AFN) that implements low-latency collectives without using or interfering with the memory hierarchy. For a modest increase in circuit complexity, traffic within a CMP's internal network is dramatically reduced, improving the performance of caches and reducing power consumption. Full system simulations of 16-core CMPs show a CMP-AFN outperforms the reference design significantly, eliminating more than 60% of memory accesses and more than 70% of private L1 data cache misses in both the EPCC OpenMP microbenchmarks and SPEC OMP benchmarks.