Learning regular expressions from noisy sequences

  • Authors:
  • Ugo Galassi;Attilio Giordana

  • Affiliations:
  • Dipartimento di Informatica, Università Amedeo Avogadro, Alessandria, Italy;Dipartimento di Informatica, Università Amedeo Avogadro, Alessandria, Italy

  • Venue:
  • SARA'05 Proceedings of the 6th international conference on Abstraction, Reformulation and Approximation
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The presence of long gaps dramatically increases the diffculty of detecting and characterizing complex events hidden in long sequences. In order to cope with this problem, a learning algorithm based on an abstraction mechanism is proposed: it can infer the general model of complex events from a set of learning sequences. Events are described by means of regular expressions, and the abstraction mechanism is based on the substitution property of regular languages. The induction algorithm proceeds bottom-up, progressively coarsening the sequence granularity, letting correlations between subsequences, separated by long gaps, naturally emerge. Two abstraction operators are defined. The first one detects, and abstracts into non-terminal symbols, regular expressions not containing iterative constructs. The second one detects and abstracts iterated subsequences. By interleaving the two operators, regular expressions in general form may be inferred. Both operators are based on string alignment algorithms taken from bio-informatics. A restricted form of the algorithm has already been outlined in previous papers, where the emphasis was on applications. Here, the algorithm, in an extended version, is described and analyzed into details.