Stochastic Grammatical Inference of Text Database Structure

  • Authors:
  • Matthew Young-Lai;Frank Wm. Tompa

  • Affiliations:
  • Computer Science Department, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. mdyounglai@neumann.uwaterloo.ca;Computer Science Department, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1

  • Venue:
  • Machine Learning
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.