Optimization techniques for efficient HTA programs

Authors:
Basilio B. Fraguela;Ganesh Bikshandi;Jia Guo;MaríA J. GarzaráN;David Padua;Christoph Von Praun
Affiliations:
Depto. de Electrónica e Sistemas, Universidade da Coruña, Facultade de Informática, Campus de Elviña, S/N, 15071 A Coruña, Spain;Intel Labs, Intel Technology India Pvt. Ltd., Bangalore 560 103, Karnataka, India;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Fakultät Informatik, Georg-Simon-Ohm Hochschule, Postfach 210320, 90121 Nuremberg, Germany
Venue:
Parallel Computing
Year:
2012

Citing 33
Cited 1

Updating distributed variables in local computations

Concurrency: Practice and Experience
Compiler optimizations for Fortran D on MIMD distributed-memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Programmable syntax macros

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A metaobject protocol for C++

Proceedings of the tenth annual conference on Object-oriented programming systems, languages, and applications
Interprocedural data flow based optimizations for distributed memory compilation

Software—Practice & Experience
The implementation and evaluation of fusion and contraction in array languages

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Loop fusion in high performance Fortran

ICS '98 Proceedings of the 12th international conference on Supercomputing
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Scientific and Engineering C++: An Introduction with Advanced Techniques and Examples

Scientific and Engineering C++: An Introduction with Advanced Techniques and Examples
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
The Matrix Template Library: Generic Components for High-Performance Scientific Computing

Computing in Science and Engineering
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Delayed Evaluation, Self-optimising Software Components as a Programming Model

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
A Skeleton Library

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Array Design and Expression Evaluation in POOMA II

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
Arrays in Blitz++

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
A Generalized Framework for Global Communication Optimization

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
POET (Parallel Object-oriented Environment and Toolkit) and Frameworks for Scientific Distributed Computing

HICSS '97 Proceedings of the 30th Hawaii International Conference on System Sciences: Software Technology and Architecture - Volume 1
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
QUAFF: efficient C++ design for parallel skeletons

Parallel Computing - Algorithmic skeletons
A Complexity Measure

IEEE Transactions on Software Engineering
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Exploiting locality and parallelism with hierarchically tiled arrays

Exploiting locality and parallelism with hierarchically tiled arrays
A Case Study of Some Issues in the Optimization of Fortran 90 Array Notation

Scientific Programming
Intel threading building blocks

Intel threading building blocks
Writing productive stencil codes with overlapped tiling

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Design and use of htalib: a library for hierarchically tiled arrays

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A parallel numerical solver using hierarchically tiled arrays

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Object oriented languages can be easily extended with new data types, which facilitate prototyping new language extensions. A very challenging problem is the development of data types encapsulating data parallel operations, which could improve parallel programming productivity. However, the use of class libraries to implement data types, particularly when they encapsulate parallelism, comes at the expense of performance overhead. This paper describes our experience with the implementation of a C++ data type called hierarchically tiled array (HTA). This object includes data parallel operations and allows the manipulation of tiles to facilitate developing efficient parallel codes and codes with high degree of locality. The initial performance of the HTA programs we wrote was lower than that of their conventional MPI-based counterparts. The overhead was due to factors such as the creation of temporary HTAs and the inability of the compiler to properly inline index computations, among others. We describe the performance problems and the optimizations applied to overcome them as well as their impact on programmability. After the optimization process, our HTA-based implementations run only slightly slower than the MPI-based codes while having much better programmability metrics.