Productive cluster programming with OmpSs

Authors:
Javier Bueno;Luis Martinell;Alejandro Duran;Montse Farreras;Xavier Martorell;Rosa M. Badia;Eduard Ayguade;Jesús Labarta
Affiliations:
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya;Barcelona Supercomputing Center;Barcelona Supercomputing Center;Barcelona Supercomputing Center and Universitat Politècnica de Catalunya;Barcelona Supercomputing Center and Universitat Politècnica de Catalunya;Barcelona Supercomputing Center and Artificial Intelligence Research Institute, Spanish National Research Council;Barcelona Supercomputing Center and Universitat Politècnica de Catalunya;Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Year:
2011

Citing 11
Cited 8

Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards automatic translation of OpenMP to MPI

Proceedings of the 19th annual international conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Running OpenMP applications efficiently on an everything-shared SDSM

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Effective communication and computation overlap with hybrid MPI/SMPSs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Extending the OpenMP tasking model to allow dependent tasks

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Trace-driven simulation of multithreaded applications

ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software

A high-level Fortran interface to parallel matrix algebra

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
On the instrumentation of OpenMP and ompss tasking constructs

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Implementing OmpSs support for regions of data in architectures with multiple address spaces

Proceedings of the 27th international ACM conference on International conference on supercomputing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
CPU+GPU scheduling with asymptotic profiling

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters of SMPs are ubiquitous. They have been traditionally programmed by using MPI. But, the productivity of MPI programmers is low because of the complexity of expressing parallelism and communication, and the difficulty of debugging. To try to ease the burden on the programmer new programming models have tried to give the illusion of a global shared-address space (e.g., UPC, Co-array Fortran). Unfortunately, these models do not support, increasingly common, irregular forms of parallelism that require asynchronous task parallelism. Other models, such as X10 or Chapel, provide this asynchronous parallelism but the programmer is required to rewrite entirely his application. We present the implementation of OmpSs for clusters, a variant of OpenMP extended to support asynchrony, heterogeneity and data movement for task parallelism. As OpenMP, it is based on decorating an existing serial version with compiler directives that are translated into calls to a runtime system that manages the parallelism extraction and data coherence and movement. Thus, the same program written in OmpSs can run in a regular SMP machine, in clusters of SMPs, or even can be used for debugging with the serial version. The runtime uses the information provided by the programmer to distribute the work across the cluster while optimizes communications using affinity scheduling and caching of data. We have evaluated our proposal with a set of kernels and the OmpSs versions obtain a performance comparable, or even superior, to the one obtained by the same version of MPI.