Optimization techniques for efficient HTA programs

  • Authors:
  • Basilio B. Fraguela;Ganesh Bikshandi;Jia Guo;MaríA J. GarzaráN;David Padua;Christoph Von Praun

  • Affiliations:
  • Depto. de Electrónica e Sistemas, Universidade da Coruña, Facultade de Informática, Campus de Elviña, S/N, 15071 A Coruña, Spain;Intel Labs, Intel Technology India Pvt. Ltd., Bangalore 560 103, Karnataka, India;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA;Fakultät Informatik, Georg-Simon-Ohm Hochschule, Postfach 210320, 90121 Nuremberg, Germany

  • Venue:
  • Parallel Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Object oriented languages can be easily extended with new data types, which facilitate prototyping new language extensions. A very challenging problem is the development of data types encapsulating data parallel operations, which could improve parallel programming productivity. However, the use of class libraries to implement data types, particularly when they encapsulate parallelism, comes at the expense of performance overhead. This paper describes our experience with the implementation of a C++ data type called hierarchically tiled array (HTA). This object includes data parallel operations and allows the manipulation of tiles to facilitate developing efficient parallel codes and codes with high degree of locality. The initial performance of the HTA programs we wrote was lower than that of their conventional MPI-based counterparts. The overhead was due to factors such as the creation of temporary HTAs and the inability of the compiler to properly inline index computations, among others. We describe the performance problems and the optimizations applied to overcome them as well as their impact on programmability. After the optimization process, our HTA-based implementations run only slightly slower than the MPI-based codes while having much better programmability metrics.