Factor automata of automata and applications

  • Authors:
  • Mehryar Mohri;Pedro Moreno;Eugene Weinstein

  • Affiliations:
  • Courant Institute of Mathematical Sciences, New York, NY and Google Research, New York, NY;Google Research, New York, NY;Courant Institute of Mathematical Sciences, New York, NY and Google Research, New York, NY

  • Venue:
  • CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.