The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

Authors:
Dimitrios S. Nikolopoulos;Eduard Ayguadé;Theodore S. Papatheodorou;Constantine D. Polychronopoulos;Jesús Labarta
Affiliations:
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1308 West Main Street, Urbana, IL;Department d' Arquirectura, de Computadors, Universitat Politecnica de Catalunya, c/Jordi Girona 1-3 08034, Barcelona, Spain;Department of Computer, Engineering and Informatics, University of Patras, Rion, 26500, Patras, Greece;Coordinated Science Laboratory, University of Illinois, at Urbana-Champaign, 1308 West Main Street, Urbana, IL;Department d' Arquirectura, de Computadors, Universitat Politecnica de Catalunya, c/Jordi Girona 1-3 08034, Barcelona, Spain
Venue:
ICS '01 Proceedings of the 15th international conference on Supercomputing
Year:
2001

Citing 12
Cited 5

Run-time parallelization and scheduling of loops

SPAA '89 Proceedings of the first annual ACM symposium on Parallel algorithms and architectures
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
A comparison of three programming models for adaptive applications on the Origin2000

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Is data distribution necessary in OpenMP?

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Extending OpenMP for NUMA machines

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing

Exploiting memory affinity in OpenMP through schedule reuse

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Scaling irregular parallel codes with minimal programming effort

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Study of Implicit Data Distribution Methods for OpenMP Using the SPEC Benchmarks

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
ARMI: an adaptive, platform independent communication library

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores previously established and novel methods for scaling the performance of OpenMP on NUMA architectures. The spectrum of methods under investigation includes OS-level automatic page placement algorithms, dynamic page migrationd manual data distribution. The trade-off that these methods face lies between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration is also transparent, but requires careful engineering of online algorithms to be effective. Manual data distribution on the other requires substantial programming effort and architecture-specific extensions to OpenMP, but may localize memory accesses in a nearly optimal manner.The main contributions of the paper are: a classification of application characteristics, which identifies clearly the conditions under which transparent methods are both capable and sufficient for optimizing memory locality in an OpenMP program; and the use of two novel runtime techniques, runtime data distribution based on memory access traces and affinity scheduling with iteration schedule reuse, as competitive substitutes of manual data distribution in several important classes of applications.