Is data distribution necessary in OpenMP?

Authors:
Dimitrios S. Nikolopoulos;Theodore S. Papatheodorou;Constantine D. Polychronopoulos;Jesus Labarta;Eduard Ayguade/eacute/
Affiliations:
Department of Computer Engineering and Informatics, University of Patras, Greece;Department of Computer Engineering and Informatics, University of Patras, Greece;Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Department of Computer Architecture, Technical University of Catalonia, Spain;Department of Computer Architecture, Technical University of Catalonia, Spain
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 15
Cited 29

Reference history, page size, and migration daemons in local/remote architectures

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Simple but effective techniques for NUMA memory management

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Tapeworm: high-level abstractions of shared accesses

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Impact of CC-NUMA Memory Management Policies on the Application Performance of Multistage Switching Networks

IEEE Transactions on Parallel and Distributed Systems
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
OpenMP on networks of workstations

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing

The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

ICS '01 Proceedings of the 15th international conference on Supercomputing
Exploiting memory affinity in OpenMP through schedule reuse

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Scaling irregular parallel codes with minimal programming effort

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
OpenMP versus MPI for PDE Solvers Based on Regular Sparse Numerical Operators

ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Automation of Data Traffic Control on DSM Architectures

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Performance of High-Accuracy PDE Solvers on a Self-Optimizing NUMA Architecture

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Study of Implicit Data Distribution Methods for OpenMP Using the SPEC Benchmarks

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
User-controllable coherence for high performance shared memory multiprocessors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm engineering for parallel computation

Experimental algorithmics
OpenMP versus MPI for PDE solvers based on regular sparse numerical operators

Future Generation Computer Systems
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
How good is OpenMP

Scientific Programming - OpenMP
OpenMP issues arising in the development of parallel BLAS and LAPACK libraries

Scientific Programming - OpenMP
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP
Towards optimisation of openMP codes for synchronisation and data reuse

International Journal of High Performance Computing and Networking
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Large-scale phylogenetic analysis on current HPC architectures

Scientific Programming - Large-Scale Programming Tools and Environments
OpenMP versus MPI for PDE solvers based on regular sparse numerical operators

Future Generation Computer Systems
Language support for multi-paradigm and multi-grain parallelism on SMP-Cluster

International Journal of Computers and Applications
OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000

ICCS'03 Proceedings of the 2003 international conference on Computational science
Asynchronous execution of OpenMP code

ICCS'03 Proceedings of the 2003 international conference on Computational science
Analyses for the translation of OpenMP codes into SPMD style with array privatization

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Improving the performance of OpenMP by array privatization

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Towards NUMA support with distance information

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Node-based memory management for scalable NUMA architectures

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes-such as round-robin or random distribution of pages-incur modest performance losses. We also show that performance leaks stemming from suboptimal page placement schemes can be remedied with a smart user-level page migration engine. The main body of the paper describes how the OpenMP runtimeenvironment can use page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results support the effectiveness of these mechanisms and provide a proof of concept that there is no need to introduce data distribution directives in OpenMP and warrant the portability of the programming model.