On the Scalability of an Automatically Parallelized Irregular Application

Authors:
Martin Burtscher;Milind Kulkarni;Dimitrios Prountzos;Keshav Pingali
Affiliations:
Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712;Center for Grid and Distributed Computing Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin TX 78712
Venue:
Languages and Compilers for Parallel Computing
Year:
2008

Citing 16
Cited 2

Detecting conflicts between structure accesses

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Guaranteed-quality mesh generation for curved surfaces

SCG '93 Proceedings of the ninth annual symposium on Computational geometry
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Putting pointer analysis to work

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Parametric shape analysis via 3-valued logic

Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization

IEEE Transactions on Parallel and Distributed Systems
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Parallelizing Programs with Recursive Data Structures

IEEE Transactions on Parallel and Distributed Systems
Transactional Memory (Synthesis Lectures on Computer Architecture)

Transactional Memory (Synthesis Lectures on Computer Architecture)
Open nesting in software transactional memory

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimistic parallelism requires abstractions

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Algorithm 872: Parallel 2D constrained Delaunay mesh generation

ACM Transactions on Mathematical Software (TOMS)
Optimistic parallelism benefits from data partitioning

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Three-dimensional delaunay refinement for multi-core processors

Proceedings of the 22nd annual international conference on Supercomputing
Scheduling strategies for optimistic parallel execution of irregular programs

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures

Delegated isolation

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Isolation for nested task parallelism

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Irregular applications, i.e. , programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types. In this paper, we study the performance and scalability of a Galoised, that is, automatically parallelized, version of Delaunay mesh refinement (DR) on a shared-memory system with 128 CPUs. DR is an important irregular application that is used, e.g. , in graphics and finite-element codes. The parallelized program scales to 64 threads, where it reaches a speedup of 25.8. For large numbers of threads, the performance is hampered by the load imbalance and the nonuniform memory latency, both of which grow as the number of threads increases. While these two issues will have to be addressed in future work, we believe our results already show the Galois approach to be very promising.