Suffix arrays: a new method for on-line string searches
SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Linear work suffix array construction
Journal of the ACM (JACM)
A taxonomy of suffix array construction algorithms
ACM Computing Surveys (CSUR)
Scalable parallel suffix array construction
Parallel Computing
Theoretical Computer Science
Better external memory suffix array construction
Journal of Experimental Algorithmics (JEA)
Linear Suffix Array Construction by Almost Pure Induced-Sorting
DCC '09 Proceedings of the 2009 Data Compression Conference
Simulated Annealing with Iterative Improvement
ICSPS '09 Proceedings of the 2009 International Conference on Signal Processing Systems
Designing efficient sorting algorithms for manycore GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Parallel Lexicographic Names Construction with CUDA
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Space efficient linear time construction of suffix arrays
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
We present the design of the algorithm for constructing the suffix array of a string using manycore GPUs. Despite of the wide usage in text processing and extensive research over two decades there was a lack of efficient algorithms that were able to exploit shared memory parallelism (as multicore CPUs as manycore GPUs) in practice. To the best of our knowledge we developed the first approach exposing shared memory parallelism that significantly outperforms the state-of-the-art existing implementations for sufficiently large inputs. We reduced the suffix array construction problem to a number of parallel primitives such as prefix-sum, radix sorting, random gather and scatter from/to the memory. Thus, the performance of the algorithm merely depends on the performance of these primitives on the particular shared memory architecture. We demonstrate its performance on manycore GPUs, but the method can also be applied for other parallel architectures, such as multicores, CELL or Intel MIC.