Parallel blocked algorithm for solving the algebraic path problem on a matrix processor

Authors:
Akihito Takahashi;Stanislav Sedukhin
Affiliations:
Graduate School of Computer Science and Engineering, University of Aizu, Aizuwakamatsu City, Fukushima, Japan;Graduate School of Computer Science and Engineering, University of Aizu, Aizuwakamatsu City, Fukushima, Japan
Venue:
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Year:
2005

Citing 6
Cited 0

Parallel implementation of the algebraic path problem

Proc. of the conference on algorithms and hardware for parallel processing on CONPAR 86
Minimum-cost spanning tree as a path-finding problem

Information Processing Letters
Algorithm 97: Shortest path

Communications of the ACM
A Blocked All-Pairs Shortest-Path Algorithm

SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
Cache-Friendly Implementations of Transitive Closure

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a parallel blocked algorithm for the algebraic path problem (APP). It is known that the complexity of the APP is the same as that of the classical matrix-matrix multiplication; however, solving the APP takes much more running time because of its unique data dependencies that limits data reuse drastically. We examine a parallel implementation of a blocked algorithm for the APP on the one-chip Intrinsity FastMATH adaptive processor, which consists of a scalar MIPS processor extended with a SIMD matrix coprocessor. The matrix coprocessor supports native matrix instructions on an array of 4 × 4 processing elements. Implementing with matrix instructions requires us to transform algorithms in terms of matrix-matrix operations. Conventional vectorization for SIMD vector processing deals with only the innermost loop; however, on the FastMATH processor, we need to vectorize two or three nested loops in order to convert the loops to equivalent one matrix operation. Our experimental results show a peak performance of 9.27 GOPS and high usage rates of matrix instructions for solving the APP. Findings from our experimental results indicate that the SIMD matrix extension to (super)scalar processor would be very useful for fast solution of many matrix-formulated problems.