Approximate weighted matching on emerging manycore and multithreaded architectures

Authors:
Mahantesh Halappanavar;John Feo;Oreste Villa;Antonino Tumeo;Alex Pothen
Affiliations:
Pacific Northwest National Laboratory, Richland, WA, USA;Pacific Northwest National Laboratory, Richland, WA, USA;Pacific Northwest National Laboratory, Richland, WA, USA;Pacific Northwest National Laboratory, Richland, WA, USA;Purdue University, West Lafayette, IN, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2012

Citing 22
Cited 2

Computing the block triangular form of a sparse matrix

ACM Transactions on Mathematical Software (TOMS)
Analysis of multilevel graph partitioning

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices

SIAM Journal on Matrix Analysis and Applications
Quality matching and local improvement for multilevel graph-partitioning

Parallel Computing - Special issue on graph partioning and parallel computing
Shape Matching and Object Recognition Using Shape Contexts

IEEE Transactions on Pattern Analysis and Machine Intelligence
A simple approximation algorithm for the weighted matching problem

Information Processing Letters
ELDORADO

Proceedings of the 2nd conference on Computing frontiers
Matching Theory (North-Holland mathematics studies)

Matching Theory (North-Holland mathematics studies)
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Graph mining: Laws, generators, and algorithms

ACM Computing Surveys (CSUR)
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Matching-based preprocessing algorithms to the solution of saddle-point problems in large-scale nonconvex interior-point optimization

Computational Optimization and Applications
Evaluating the potential of multithreaded platforms for irregular scientific computations

Proceedings of the 4th international conference on Computing frontiers
All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Graph Analysis with High-Performance Computing

Computing in Science and Engineering
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Algorithms for vertex-weighted matching in graphs

Algorithms for vertex-weighted matching in graphs
Linear time 1/2 -approximation algorithm for maximum weighted matching in general graphs

STACS'99 Proceedings of the 16th annual conference on Theoretical aspects of computer science
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
A parallel approximation algorithm for the weighted maximum matching problem

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
An effective GPU implementation of breadth-first search

Proceedings of the 47th Design Automation Conference
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

A multithreaded algorithm for network alignment via approximate matching

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU accelerated maximum cardinality matching algorithms for bipartite graphs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization and are therefore important to compute matchings for large-scale problems. Approximation algorithms also generate nearly optimal solutions that are sufficient for many applications. In this paper we present multithreaded algorithms for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi), and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first uses shared work queues and is suited for all platforms; and the second implementation, based on dataflow principles, exploits special features available on the Cray XMT. Using a carefully chosen dataset that exhibits characteristics from a wide range of applications, we show scalable performance across different platforms. In particular, for one instance of the input, an R-MAT graph (RMAT-G), we show speedups of about 32 on 48 cores of an AMD Magny-Cours, 7 on 8 cores of Intel Nehalem, 3 on Nvidia Tesla and 10 on Nvidia Fermi relative to one core of Intel Nehalem, and 60 on 128 processors of Cray XMT. We demonstrate strong as well as weak scaling for graphs with up to a billion edges using up to 12,800 threads. We avoid excessive fine-tuning for each platform and retain the basic structure of the algorithm uniformly across platforms. An exception is the dataflow algorithm designed specifically for the Cray XMT. To the best of the authors' knowledge, this is the first such large-scale study of the half-approximate weighted matching problem on multithreaded platforms. Driven by the critical enabling role of combinatorial algorithms such as matching in scientific computing and the emergence of informatics applications, there is a growing demand to support irregular computations on current and future computing platforms. In this context, we evaluate the capability of emerging multithreaded platforms to tolerate latency induced by irregular memory access patterns, and to support fine-grained parallelism via light-weight synchronization mechanisms. By contrasting the architectural features of these platforms against the Cray XMT, which is specifically designed to support irregular memory-intensive applications, we delineate the impact of these choices on performance.