General suffix automaton construction algorithm and space bounds

Authors:
Mehryar Mohri;Pedro Moreno;Eugene Weinstein
Affiliations:
Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, United States and Google Research, 76 Ninth Avenue, New York, NY 10011, United States;Google Research, 76 Ninth Avenue, New York, NY 10011, United States;Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, United States and Google Research, 76 Ninth Avenue, New York, NY 10011, United States
Venue:
Theoretical Computer Science
Year:
2009

Citing 9
Cited 1

Transducers and repetitions

Theoretical Computer Science
Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Minimisation of acyclic deterministic automata in linear time

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Efficient Experimental String Matching by Weak Factor Recognition

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Algorithms for the directed acyclic word graph and related structures (data structures, suffix trees, inverted file, automata, string algorithms)

Algorithms for the directed acyclic word graph and related structures (data structures, suffix trees, inverted file, automata, string algorithms)
Finite-state transducers in language and speech processing

Computational Linguistics
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics - 12th annual symposium on combinatorial pattern matching (CPM)
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004

Efficient and robust music identification with weighted finite-state transducers

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	5.23

Visualization

Abstract

Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q-2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2@?U@?-1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where @?U@? is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.