On the Scalability of an Automatically Parallelized Irregular Application

  • Authors:
  • Martin Burtscher;Milind Kulkarni;Dimitrios Prountzos;Keshav Pingali

  • Affiliations:
  • Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712

  • Venue:
  • Languages and Compilers for Parallel Computing
  • Year:
  • 2008
  • Delegated isolation

    Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications

  • Isolation for nested task parallelism

    Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

Quantified Score

Hi-index 0.00

Visualization

Abstract

Irregular applications, i.e. , programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types. In this paper, we study the performance and scalability of a Galoised, that is, automatically parallelized, version of Delaunay mesh refinement (DR) on a shared-memory system with 128 CPUs. DR is an important irregular application that is used, e.g. , in graphics and finite-element codes. The parallelized program scales to 64 threads, where it reaches a speedup of 25.8. For large numbers of threads, the performance is hampered by the load imbalance and the nonuniform memory latency, both of which grow as the number of threads increases. While these two issues will have to be addressed in future work, we believe our results already show the Galois approach to be very promising.