Time Optimal Linear Schedules for Algorithms with Uniform Dependencies
IEEE Transactions on Computers
Algorithms for the Longest Common Subsequence Problem
Journal of the ACM (JACM)
Introduction to algorithms
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
A Survey of Longest Common Subsequence Algorithms
SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Code Generation in the Polyhedral Model Is Easier Than You Think
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Communication-Aware Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
AICCSA '10 Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010
Hi-index | 0.00 |
The longest common subsequence (LCS) problem is an important algorithm in computer science with many applications such as DNA matching (bio-engineering) and file comparison (UNIX diff). While there has been a lot of research for finding an efficient solution to this problem, the research emphasis has shifted with the advent of multi-core architectures towards multithreaded implementations. This paper applies supernode transformations to partition the dynamic programming solution of the LCS problem into multiple threads. Then, we enhance this method by presenting a transformation matrix that skews the loop nest such that loop carried dependencies of the inner loop are eliminated in each supernode. We find that this technique performs well on microarchitectures supporting out-of-order execution while in-order execution machines do not benefit from it. Furthermore, we present a variation of the supernode transformations and multithreading strategy which groups entire rows of the index set to form a supernode. The inter thread synchronization is performed by an array of mutexes. We find that this scheme reduces the amount of thread management overhead and improves the data locality. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12 core and a four core machine. At the 12 core machine the traditional supernode transformation speeds up the original loop nest 16.7 times. We enhance this technique to score a 42.6 speedup and apply our new method scoring a 59.5 speedup. We experience the phenomenon of super-linear speedup as the the performance gain is larger than the number of processing cores. Concepts presented and discussed in this paper on the LCS problem are generally applicable to regular dependence algorithms.