Efficient computation of substring equivalence classes with suffix arrays

Authors:
Kazuyuki Narisawa;Shunsuke Inenaga;Hideo Bannai;Masayuki Takeda
Affiliations:
Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan and SORST, Japan Science and Technology Agency
Venue:
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Year:
2007

Citing 13
Cited 4

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Discovering characteristic expressions in literary works

Theoretical Computer Science
A Corpus for the Evaluation of Lossless Compression Algorithms

DCC '97 Proceedings of the Conference on Data Compression
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming

Unsupervised spam detection based on string alienness measures

DS'07 Proceedings of the 10th international conference on Discovery science
Minimum Unique Substrings and Maximum Repeats

Fundamenta Informaticae - Theory that Counts: To Oscar Ibarra on His 70th Birthday
Computing regularities in strings: A survey

European Journal of Combinatorics
Space-Efficient computation of maximal and supermaximal repeats in genome sequences

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.