Factor automata of automata and applications

Authors:
Mehryar Mohri;Pedro Moreno;Eugene Weinstein
Affiliations:
Courant Institute of Mathematical Sciences, New York, NY and Google Research, New York, NY;Google Research, New York, NY;Courant Institute of Mathematical Sciences, New York, NY and Google Research, New York, NY
Venue:
CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Year:
2007

Citing 5
Cited 1

Transducers and repetitions

Theoretical Computer Science
Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Finite-state transducers in language and speech processing

Computational Linguistics
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004

Efficient and robust music identification with weighted finite-state transducers

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.